-
Loading the Document
Load a PDF research paper into the system.from langchain.document_loaders import PyPDFLoader # Load the PDF loader = PyPDFLoader("research_paper.pdf") documents = loader.load() print(f"Loaded {len(documents)} pages from the document.")
-
Splitting the Document
Divide the document into smaller chunks for processing.from langchain.text_splitter import RecursiveCharacterTextSplitter # Split the document into chunks text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) chunks = text_splitter.split_documents(documents) print(f"Split document into {len(chunks)} chunks.")
-
Creating Embeddings
Convert the chunks into vector embeddings.from langchain.embeddings import OpenAIEmbeddings # Generate embeddings embeddings = OpenAIEmbeddings() chunk_embeddings = [embeddings.embed_document(chunk) for chunk in chunks] print("Embeddings created for all chunks.")
-
Storing in a Vector Store
Store these embeddings in a vector database for retrieval.from langchain.vectorstores import FAISS # Save embeddings in FAISS vector_store = FAISS.from_documents(chunks, embeddings) print("Embeddings stored in FAISS vector database.")
-
Accepts a User's Question
Take a question from the user.user_query = "What are the main findings of the research?"
-
Finds Relevant Content
Use the vector database to retrieve the most relevant chunks for the query.from langchain.chains import RetrievalQAChain from langchain.llms import OpenAI # Set up retriever retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k":3}) # Use a QA chain to retrieve relevant content llm = OpenAI(model="text-davinci-003") qa_chain = RetrievalQAChain.from_chain_type(llm=llm, retriever=retriever)
-
Generates an Answer
Generate the answer based on the retrieved content.answer = qa_chain.run(user_query) print(f"Answer: {answer}")
- Loading: The PDF is loaded into the system as
documents
. - Splitting: The PDF is split into manageable chunks with overlapping sections for context.
- Embedding: The chunks are embedded into high-dimensional vectors.
- Storage: The embeddings are stored in FAISS, making them searchable.
- Retrieval: When a user asks a question, the system retrieves the most relevant chunks.
- Answer Generation: An LLM generates an answer based on the retrieved content.
User Question: "What are the main findings of the research?"
System Answer: "The research highlights that X intervention improves Y outcomes by Z% according to the data in Table 2."
This demonstrates how LangChain simplifies ingestion and generation phases for creating a scalable, efficient Q&A system.