This project utilizes HuggingFace's pretrained LLM meta-llama/Llama-2-7b-chat-hf
, fine-tuned with PDF data, to generate accurate responses to queries.
save_model.py will create and save the model locally. The code for the model was edited from this article: Build LLM Chatbot With PDF Documents. I ran the model on Google Collab using L4 GPU. The Jupyter Notebook can be accessed here: Google Collab Notebook.
run_model.py loads the model, tokenizer, embedded model, and vector index to generate example response from the LLM chatbot. Then it takes the given question and sends the model's response to the user.
You will need to create a .env file with your own HuggingFace Access Token to run the project.
The original article would run the vector embedding and model creation everytime the notebook was run. I changed their project to be able to save the model locally, so you could load the model later and directly run the queries.
A more detailed explanation can be found here: models/README.md
Running save_model.py will save the model locally. I'll briefly go over how this works:
- Using
pypdf
, all .pdf documents in the papers/ folder is read and loaded into memory.
- Authenticates and sets up the environment for using Hugging Face models (Hugging Face Login).
- The
sentence-transformers/multi-qa-MiniLM-L6-cos-v1
embedding model using a HuggingFaceEmbeddings wrapper from thelangchain
library is used to save a vector embedding model.
- Creates a Vector Store Index (Index that stores numerical representations of the data that capture their semantic meaning) from the given pdf documents and vector embedding model
- Initializes a HuggingFaceLLM from
meta-llama/Llama-2-7b-chat-hf
. - The
system_prompt
,query_wrapper_prompt
,context_window
, and other important fields are set. - Configures global settings (Settings) for the embedding model, LLM, and chunk size to optimize performance during query operations.
- The function save_model_components() will save all important files of the LLM to models/
- Vector index (index.pkl)
- Embedding model (embedding_model.pkl if using CUDA, otherwise embedding_model_cpu.pkl)
- LLM model and tokenizer (llm_model/ and llm_tokenizer/)
- LLM configuration (llm_config.json)
I used Django and React to create a simple web application as the interface for the chatbot. Users would be able to type their question in the textbox and receive a response from the model, shown when they click send.
I added a script in manage.py to automatically build the React frontend with Vite before starting the Django server, so you don't need to run npm run build
everytime.
python manage.py runserver
I tested the LLM's responses to some sample questions relevant to the pdf papers used in fine-tuning.
I ran the model on both my Macbook Air and Google Collab. Running it on my computer takes significantly longer than running it on Google Collab.
The detailed report of the runtimes and responses from the LLM can be found here: reports/README.md
I implemented RAG (Retrieval-augmented generation) to better improve the accuracy of the responses. RAG references an authoritative knowledge base outside of the training data before generating a response.
In this case, it would retrive the most relevant section of a research paper in papers/ to add as the context in the prompt.
The code for RAG is implemented in rag.py, and a detailed report of responses using RAG can be found in reports/RAG/README.md
langchain
, llama-index
, transformers
, torch
, pypdf
, python-dotenv
, einops
, accelerate
, bitsandbytes
, sentence_transformers
, sentencepiece
, Django
install --no-cache-dir -r requirements.txt