Skip to content

Commit

Permalink
update available extractors with descriptions (tensorlakeai#289)
Browse files Browse the repository at this point in the history
  • Loading branch information
lucasjacks0n authored Feb 1, 2024
1 parent 9026b1b commit 744e40e
Showing 1 changed file with 47 additions and 11 deletions.
58 changes: 47 additions & 11 deletions docs/docs/apis/available_extractors.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,56 @@
Indexify comes with some pre-built extractors that you can download and deploy with your installation. We keep these extractors always updated with new releases of models and our internal code.

## Embedding Extractors
* ColBERT
* E5
* Hash
* Jina
* Sentence Transformers - MiniLML6, MPnet
* Scibert
#### [ColBERT](https://github.com/tensorlakeai/indexify-extractors/tree/main/embedding-extractors/colbert)
This ColBERTv2-based extractor is a Python class that encapsulates the functionality to convert text inputs into vector embeddings using the ColBERTv2 model. It leverages ColBERTv2's transformer-based architecture to generate context-aware embeddings suitable for various natural language processing tasks.


#### [E5](https://github.com/tensorlakeai/indexify-extractors/tree/main/embedding-extractors/e5_embedding)
A good small and fast general model for similarity search or downstream enrichments.
Based on [E5_Small_V2](https://huggingface.co/intfloat/e5-small-v2) which only works for English texts. Long texts will be truncated to at most 512 tokens.


#### [Hash](https://github.com/tensorlakeai/indexify-extractors/tree/main/embedding-extractors/hash-embedding)
This extractor extractors an "identity-"embedding for a piece of text, or file. It uses the sha256 to calculate the unique embeding for a given text, or file. This can be used to quickly search for duplicates within a large set of data.


#### [Jina](https://github.com/tensorlakeai/indexify-extractors/tree/main/embedding-extractors/jina_base_en)
This extractor extractors an embedding for a piece of text.
It uses the huggingface [Jina model](https://huggingface.co/jinaai/jina-embeddings-v2-base-en) which is an English, monolingual embedding model supporting 8192 sequence length. It is based on a Bert architecture (JinaBert) that supports the symmetric bidirectional variant of ALiBi to allow longer sequence length.

#### [MiniLML6 - Sentence Transformer](https://github.com/tensorlakeai/indexify-extractors/tree/main/embedding-extractors/minilm-l6)
This extractor extractors an embedding for a piece of text.
It uses the huggingface [MiniLM-6 model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2), which is a tiny but very robust emebdding model for text.

#### [MPnet - Sentence Transformer](https://github.com/tensorlakeai/indexify-extractors/tree/main/embedding-extractors/mpnet)
This is a sentence embedding extractor based on the [MPNET Multilingual Base V2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2).
This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
It's best use case is paraphrasing, but it can also be used for other tasks.

#### [OpenAI Embedding](https://github.com/tensorlakeai/indexify-extractors/tree/main/embedding-extractors/openai-embedding)
This extractor extracts an embedding for a piece of text.
It uses the OpenAI text-embedding-ada-002 model.

#### [SciBERT Uncased](https://github.com/tensorlakeai/indexify-extractors/tree/main/embedding-extractors/scibert)
This is the pretrained model presented in [SciBERT: A Pretrained Language Model for Scientific Text](https://www.aclweb.org/anthology/D19-1371/), which is a BERT model trained on scientific text.
Works best with scientific text embedding extraction.


## Audio
* Whisper
#### [Whisper](https://github.com/tensorlakeai/indexify-extractors/tree/main/whisper-asr)
This extractor converts extracts transcriptions from audio. The entire text and
chunks with timestamps are represented as metadata of the content.

## Invoice
* Chord
* Donut
* Nougat
#### [Donut CORD](https://github.com/tensorlakeai/indexify-extractors/tree/main/invoice-extractor/donut_cord)
This extractor parses pdf or image form of invoice which is provided in JSON format. It uses the pre-trained [donut cord fine-tune model from huggingface](https://huggingface.co/naver-clova-ix/donut-base-finetuned-cord-v2).
This model is specially good at extracting list of product and its price from invoice.

#### [Donut Invoice](https://github.com/tensorlakeai/indexify-extractors/tree/main/invoice-extractor/donut_invoice)
This extractor parses some invoice-related data from a PDF.
It uses the pre-trained [donut model from huggingface](https://huggingface.co/docs/transformers/model_doc/donut).

## Web Extractors
* Wikipedia

#### [Wikipedia](https://github.com/tensorlakeai/indexify-extractors/tree/main/web-extractors/wikipedia)
Extract text content from wikipedia html pages.

0 comments on commit 744e40e

Please sign in to comment.