A public and free repository of embeddings datasets. We're going to embed the internet.
Here's how you can help:
A good rule of thumb is making sure that the dataset is significant enough that many people will use it + isn't too insanely large to embed or too expensive/difficult to obtain + you can imagine at least one immediate tool you could build with this data.
We'll need to remove anything that doesn't need to be embedded. Weird characters, page numbers that don't add any information to the text (chapter titles are fine), table of contents, unneeded footnotes, boilerplate text, etc.
We'll need to split the dataset into reasonable chunks of varying size. Depending on the model that we'll use (which might change depending on subject matter, size of dataset, intended use-case, etc.), all of these chunks will have to be under some token length. If it's not a gigantic dataset (and won't need to be updated over time) it's good to assume that anything under 8000 tokens will work.
How long should chunks be? How should they be split? This is somewhat confusing as it changes per dataset and per use-case. For example, if you were embedding the King James Bible, you might embed every verse individually, chunk the text into 5 verse chunks that each overlap by 2 verses, or embed every chapter as a whole. This is totally up to your best judgement. It might be one or the other, or it might be all three!
Once all of these steps are done, feel free to mention me or any other team member. We'll handle the process of embedding, storing, and hosting the model.
For more questions, join our Discord server: https://discord.gg/zWa4Ay5MPb.
- US Case Law
- Clinical Trials
- SEC Filings
- US Patents
- Anything you think of!
- All Hindu texts
- None yet
- Coming soon
- King James Bible
- Deuterocanonical books
- Apocrypha