Loading the Document
Load a PDF research paper into the system.from langchain.document_loaders import PyPDFLoader # Load the PDF loader = PyPDFLoader("research_paper.pdf") documents = loader.load() print(f"Loaded {len(documents)} pages from the document.")
Splitting the Document
Divide the document into smaller chunks for processing.from langchain.text_splitter import RecursiveCharacterTextSplitter # Split the document into chunks text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) chunks = text_splitter.split_documents(documents) print(f"Split document into {len(chunks)} chunks.")
Creating Embeddings
Convert the chunks into vector embeddings.from langchain.embeddings import OpenAIEmbeddings # Generate embeddings embeddings = OpenAIEmbeddings() chunk_embeddings = [embeddings.embed_document(chunk) for chunk in chunks] print("Embeddings created for all chunks.")
Storing in a Vector Store
Store these embeddings in a vector database for retrieval.from langchain.vectorstores import FAISS # Save embeddings in FAISS vector_store = FAISS.from_documents(chunks, embeddings) print("Embeddings stored in FAISS vector database.")
Accepts a User's Question
Take a question from the user.user_query = "What are the main findings of the research?"
Finds Relevant Content
Use the vector database to retrieve the most relevant chunks for the query.from langchain.chains import RetrievalQAChain from langchain.llms import OpenAI # Set up retriever retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k":3}) # Use a QA chain to retrieve relevant content llm = OpenAI(model="text-davinci-003") qa_chain = RetrievalQAChain.from_chain_type(llm=llm, retriever=retriever)
Generates an Answer
Generate the answer based on the retrieved content.answer = qa_chain.run(user_query) print(f"Answer: {answer}")
- Loading: The PDF is loaded into the system as
. - Splitting: The PDF is split into manageable chunks with overlapping sections for context.
- Embedding: The chunks are embedded into high-dimensional vectors.
- Storage: The embeddings are stored in FAISS, making them searchable.
- Retrieval: When a user asks a question, the system retrieves the most relevant chunks.
- Answer Generation: An LLM generates an answer based on the retrieved content.
User Question: "What are the main findings of the research?"
System Answer: "The research highlights that X intervention improves Y outcomes by Z% according to the data in Table 2."
This demonstrates how LangChain simplifies ingestion and generation phases for creating a scalable, efficient Q&A system.