Skip to content

Commit

Permalink
Use a different text splitter to improve results. Ingest takes an arg…
Browse files Browse the repository at this point in the history
…ument pointing to the doc to ingest.
imartinez committed May 5, 2023
1 parent a05186b commit 92244a9
Showing 2 changed files with 6 additions and 6 deletions.
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -20,13 +20,12 @@ This repo uses a [state of the union transcript](https://github.com/imartinez/pr

## Instructions for ingesting your own dataset

Place your .txt file in `source_documents` folder.
Edit `ingest.py` loader to point it to your document.
Get your .txt file ready.

Run the following command to ingest the data.

```shell
python ingest.py
python ingest.py <path_to_your_txt_file>
```

It will create a `db` folder containing the local vectorstore. Will take time, depending on the size of your document.
7 changes: 4 additions & 3 deletions ingest.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import LlamaCppEmbeddings
from sys import argv

def main():
# Load document and split in chunks
loader = TextLoader('./source_documents/state_of_the_union.txt', encoding='utf8')
loader = TextLoader(argv[1], encoding="utf8")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(documents)
# Create embeddings
llama = LlamaCppEmbeddings(model_path="./models/ggml-model-q4_0.bin")

0 comments on commit 92244a9

Please sign in to comment.