update docs

ockhamlabsinc · Jul 4, 2024 · 1f6b3a4 · 1f6b3a4
1 parent 9300fe2
commit 1f6b3a4
Show file tree

Hide file tree

Showing 5 changed files with 57 additions and 26 deletions.
diff --git a/docs/docs/index.md b/docs/docs/index.md
@@ -4,26 +4,10 @@
 
 Indexify is a data framework designed for building ingestion and extraction pipelines for unstructured data. These pipelines are defined using declarative configuration. Each stage of the pipeline can perform structured extraction using any AI model or transform ingested data. The pipelines start working immediately upon data ingestion into Indexify, making them ideal for interactive applications and low-latency use cases.
 
-You should use Indexify if - 
-
-1. You are working with non-trivial amount of data, >1000s of documents, audio files, videos or images. 
-2. The data volume grows over time, and LLMs need access to updated data as quickly as possible
-3. You care about reliability and availability of your ingestion pipelines. 
-4. You are working with multi-modal data, or combine multiple models into a single pipeline for data extraction.
-5. User Experience of your application degrades if your LLM application is reading stale data when data sources are updated.
+## How It Works
 
-## Why use Indexify?
-Most LLM data frameworks for unstructured data are primarily optimized for prototyping applications. A typical MVP data processing pipeline is written as follows -
-```python
-data = load_data(source)
-embedding = generate_embedding(data) 
-structured_data = structured_extraction_function(data)
-db.save(embedding)
-db.save(structured_data)
-```
-All of these lines in the above code snippet **can and will** [fail in production](https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/). If your application relies on your data framework being reliable and not losing data, you will inevitably lose data in production with LLM data frameworks designed for prototyping MVPs.
+##### Setup Ingestion Pipelines
 
-##### The Indexify Approach
 Indexify provides a declarative configuration approach. You can translate the code above into a pipeline like this,
 ```yaml
 name: 'pdf-ingestion-pipeline'
@@ -44,11 +28,53 @@ extraction_policies:
 
 3. We have written some ready to use extractors. You can write custom extractors very easily to add any data extraction/transformation code or library.
 
+##### Upload Data 
+```python
+client = IndexifyClient()
+
+content_id = client.upload_file("pdf-ingestion-pipeline", "file.pdf")
+```
+
+#### Retrieve
+Retrieve extracted data from extraction policy for the uploaded document.
+```python
+markdown = client.get_extracted_content(content_id, "pdf-ingestion-pipeline", "pdf_to_markdown")
+
+named_entities = client.get_extracted_content(content_id, "pdf-ingestion-pipeline", "entity_extractor")
+```
+
+Embeddings are automatically written into configured vector databases(default: lancedb).
+You can search by 
+```python
+results = client.search("pdf-ingestion-pipeline.embedding.embedding","Who won the 2017 NBA finals?", k=3)
+```
+
+## Multi-Modal 
+Indexify can parse PDFs, Videos, Images and Audio. You can use any model under the sun to extract data in the pipelines. We have written some ourselves, and continue to add more. You can write new extractors that wrap any local model or API under 5 minutes.
+
+## Highly Available and Fault Tolerant
+Most LLM data frameworks for unstructured data are primarily optimized for prototyping applications. A typical MVP data processing pipeline is written as follows -
+```python
+data = load_data(source)
+embedding = generate_embedding(data) 
+structured_data = structured_extraction_function(data)
+db.save(embedding)
+db.save(structured_data)
+```
+All of these lines in the above code snippet **can and will** [fail in production](https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/). If your application relies on your data framework being reliable and not losing data, you will inevitably lose data in production with LLM data frameworks designed for prototyping MVPs.
+
 Indexify is distributed on many machines to scale-out each of the stages in the above pipeline. The pipeline state is replicated across multiple machines to recover from hardware failures, software crashes of the server. You get predictable latencies and throughput for data extraction, and it's fully observable to help troubleshoot. 
 
+
 ## Local Experience
 Indexify runs locally without **any** dependencies. The pipelines developed and tested on laptops can run unchanged in production.
 
+You should use Indexify if - 
+
+1. You are working with non-trivial amount of data, >1000s of documents, audio files, videos or images. 
+2. The data volume grows over time, and LLMs need access to updated data as quickly as possible
+3. You care about reliability and availability of your ingestion pipelines. 
+
 ## Start Using Indexify
 
 Dive into [Getting Started](getting_started.md) to learn how to use Indexify.

diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -33,6 +33,7 @@ nav:
  - Key Concepts: 'concepts.md'
  - Architecture: 'architecture.md'
  - Comparisons: 'comparisons.md'
+ - Examples: 'examples_index.md'
  - CLI and UI:
  - User Interface: 'ui.md'
  - Extractor CLI: 'extractor_cli.md'

diff --git a/examples/pdf/image/README.md b/examples/pdf/image/README.md
@@ -55,12 +55,12 @@ Before we begin, ensure you have the following:
 
 1. First, run the [`image_pipeline.py`](image_pipeline.py) script to set up the extraction graph:
  ```bash
- python image_pipeline.py
+ python setup.py
  ```
 
-2. Then, run the [`upload_and_retreive.py`](upload_and_retreive.py) script to process a PDF and extract images:
+2. Then, run the [`upload_and_retrieve.py`](upload_and_retrieve.py) script to process a PDF and extract images:
  ```bash
- python upload_and_retreive.py
+ python upload_and_retrieve.py
  ```
 
  This script will:
@@ -73,7 +73,7 @@ Before we begin, ensure you have the following:
 
 You can customize the image extraction process by modifying the `extraction_graph_spec` in `image_pipeline.py`. For example, you could add additional extraction steps or change the output format.
 
-In `upload_and_retreive.py`, you can modify the `pdf_url` variable to process different PDF documents.
+In `upload_and_retrieve.py`, you can modify the `pdf_url` variable to process different PDF documents.
 
 ## Conclusion
 

diff --git a/examples/pdf/image/image_pipeline.py → examples/pdf/image/setup.py b/examples/pdf/image/image_pipeline.py → examples/pdf/image/setup.py
@@ -12,4 +12,4 @@
 """
 
 extraction_graph = ExtractionGraph.from_yaml(extraction_graph_spec)
-client.create_extraction_graph(extraction_graph)
+client.create_extraction_graph(extraction_graph)
diff --git a/examples/pdf/image/upload_and_retreive.py → examples/pdf/image/upload_and_retrieve.py b/examples/pdf/image/upload_and_retreive.py → examples/pdf/image/upload_and_retrieve.py
@@ -24,7 +24,7 @@ def get_images(pdf_path):
  policy_name="pdf_to_image"
  )
 
- return images[0]['content']
+ return images
 
 # Example usage
 if __name__ == "__main__":
@@ -36,5 +36,9 @@ def get_images(pdf_path):
 
  # Get images from the PDF
  images = get_images(pdf_path)
- print("Images from the PDF:")
- print(images)
+ for image in images:
+ content_id = image["id"]
+ with open(f"{content_id}.png", 'wb') as f:
+ print("writing image ", image["id"])
+ f.write(image["content"])
+