Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
diptanu committed Jul 4, 2024
1 parent 9300fe2 commit 1f6b3a4
Show file tree
Hide file tree
Showing 5 changed files with 57 additions and 26 deletions.
62 changes: 44 additions & 18 deletions docs/docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,26 +4,10 @@

Indexify is a data framework designed for building ingestion and extraction pipelines for unstructured data. These pipelines are defined using declarative configuration. Each stage of the pipeline can perform structured extraction using any AI model or transform ingested data. The pipelines start working immediately upon data ingestion into Indexify, making them ideal for interactive applications and low-latency use cases.

You should use Indexify if -

1. You are working with non-trivial amount of data, >1000s of documents, audio files, videos or images.
2. The data volume grows over time, and LLMs need access to updated data as quickly as possible
3. You care about reliability and availability of your ingestion pipelines.
4. You are working with multi-modal data, or combine multiple models into a single pipeline for data extraction.
5. User Experience of your application degrades if your LLM application is reading stale data when data sources are updated.
## How It Works

## Why use Indexify?
Most LLM data frameworks for unstructured data are primarily optimized for prototyping applications. A typical MVP data processing pipeline is written as follows -
```python
data = load_data(source)
embedding = generate_embedding(data)
structured_data = structured_extraction_function(data)
db.save(embedding)
db.save(structured_data)
```
All of these lines in the above code snippet **can and will** [fail in production](https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/). If your application relies on your data framework being reliable and not losing data, you will inevitably lose data in production with LLM data frameworks designed for prototyping MVPs.
##### Setup Ingestion Pipelines

##### The Indexify Approach
Indexify provides a declarative configuration approach. You can translate the code above into a pipeline like this,
```yaml
name: 'pdf-ingestion-pipeline'
Expand All @@ -44,11 +28,53 @@ extraction_policies:

3. We have written some ready to use extractors. You can write custom extractors very easily to add any data extraction/transformation code or library.

##### Upload Data
```python
client = IndexifyClient()
content_id = client.upload_file("pdf-ingestion-pipeline", "file.pdf")
```

#### Retrieve
Retrieve extracted data from extraction policy for the uploaded document.
```python
markdown = client.get_extracted_content(content_id, "pdf-ingestion-pipeline", "pdf_to_markdown")
named_entities = client.get_extracted_content(content_id, "pdf-ingestion-pipeline", "entity_extractor")
```

Embeddings are automatically written into configured vector databases(default: lancedb).
You can search by
```python
results = client.search("pdf-ingestion-pipeline.embedding.embedding","Who won the 2017 NBA finals?", k=3)
```

## Multi-Modal
Indexify can parse PDFs, Videos, Images and Audio. You can use any model under the sun to extract data in the pipelines. We have written some ourselves, and continue to add more. You can write new extractors that wrap any local model or API under 5 minutes.

## Highly Available and Fault Tolerant
Most LLM data frameworks for unstructured data are primarily optimized for prototyping applications. A typical MVP data processing pipeline is written as follows -
```python
data = load_data(source)
embedding = generate_embedding(data)
structured_data = structured_extraction_function(data)
db.save(embedding)
db.save(structured_data)
```
All of these lines in the above code snippet **can and will** [fail in production](https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/). If your application relies on your data framework being reliable and not losing data, you will inevitably lose data in production with LLM data frameworks designed for prototyping MVPs.

Indexify is distributed on many machines to scale-out each of the stages in the above pipeline. The pipeline state is replicated across multiple machines to recover from hardware failures, software crashes of the server. You get predictable latencies and throughput for data extraction, and it's fully observable to help troubleshoot.


## Local Experience
Indexify runs locally without **any** dependencies. The pipelines developed and tested on laptops can run unchanged in production.

You should use Indexify if -

1. You are working with non-trivial amount of data, >1000s of documents, audio files, videos or images.
2. The data volume grows over time, and LLMs need access to updated data as quickly as possible
3. You care about reliability and availability of your ingestion pipelines.

## Start Using Indexify

Dive into [Getting Started](getting_started.md) to learn how to use Indexify.
Expand Down
1 change: 1 addition & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ nav:
- Key Concepts: 'concepts.md'
- Architecture: 'architecture.md'
- Comparisons: 'comparisons.md'
- Examples: 'examples_index.md'
- CLI and UI:
- User Interface: 'ui.md'
- Extractor CLI: 'extractor_cli.md'
Expand Down
8 changes: 4 additions & 4 deletions examples/pdf/image/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,12 @@ Before we begin, ensure you have the following:

1. First, run the [`image_pipeline.py`](image_pipeline.py) script to set up the extraction graph:
```bash
python image_pipeline.py
python setup.py
```

2. Then, run the [`upload_and_retreive.py`](upload_and_retreive.py) script to process a PDF and extract images:
2. Then, run the [`upload_and_retrieve.py`](upload_and_retrieve.py) script to process a PDF and extract images:
```bash
python upload_and_retreive.py
python upload_and_retrieve.py
```

This script will:
Expand All @@ -73,7 +73,7 @@ Before we begin, ensure you have the following:

You can customize the image extraction process by modifying the `extraction_graph_spec` in `image_pipeline.py`. For example, you could add additional extraction steps or change the output format.

In `upload_and_retreive.py`, you can modify the `pdf_url` variable to process different PDF documents.
In `upload_and_retrieve.py`, you can modify the `pdf_url` variable to process different PDF documents.

## Conclusion

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@
"""

extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
client.create_extraction_graph(extraction_graph)
client.create_extraction_graph(extraction_graph)
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ def get_images(pdf_path):
policy_name="pdf_to_image"
)

return images[0]['content']
return images

# Example usage
if __name__ == "__main__":
Expand All @@ -36,5 +36,9 @@ def get_images(pdf_path):

# Get images from the PDF
images = get_images(pdf_path)
print("Images from the PDF:")
print(images)
for image in images:
content_id = image["id"]
with open(f"{content_id}.png", 'wb') as f:
print("writing image ", image["id"])
f.write(image["content"])

0 comments on commit 1f6b3a4

Please sign in to comment.