Skip to content

It is a multimodal multi document RAG. Handle pdf, docx, xlsx, csv and image formats using gemini-flash

Notifications You must be signed in to change notification settings

ankitsrivastava637/multi-document-MMRAG-gemini

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-document-MMRAG-gemini Application

It is a multimodal multi document RAG. Handles pdf, docx, xlsx, csv and image formats using gemini-flash for real-estate sector.

The MAIN app uses :

  1. streamlit
  2. gemini-flash for convesation and gemini embeddings model through gemini api
  3. Faiss for vector store
  4. Langchain
  5. tesseract-ocr and other librarues mentioned in reuirements.txt

Contents

Getting Started

Follow these steps to clone and run the application.

Prerequisites

  • Python 3.6 or higher
  • git installed on your machine
  • google's gemini api key

Installation

  1. Clone the Repository

    Open your terminal and run the following command:

    git clone https://github.com/ankitsrivastava637/multi-document-MMRAG-gemini.git
  2. Navigate to the Repository Directory

    Change your current directory to the cloned repository:

    cd multi-document-MMRAG-gemini
  3. Set Up a Virtual Environment (Optional but Recommended)

    Create a virtual environment to manage dependencies:

    python -m venv venv

    Activate the virtual environment:

    • On Windows:

      .\venv\Scripts\activate
    • On macOS and Linux:

      source venv/bin/activate
  4. Install Dependencies

    Install the required dependencies using pip:

    pip install -r requirements.txt
  5. Set Google's gemini api key

  6. Run the Streamlit Application

    Start the Streamlit application:

    streamlit run app.py

    This command will launch the Streamlit application in your default web browser. If the main script has a different name, replace app.py with the appropriate file name.

Access the Application

Once the application is running, you can access it by opening the URL provided in the terminal (typically http://localhost:8501) in your web browser.

Additional Tips

  • If there are environment variables required by the application, make sure to set them up in your environment or create a .env file in the project directory.
  • Check the repository's README file for any additional setup instructions or configuration options specific to the application.

By following these steps, you should be able to clone and run the Streamlit application successfully.

Few improvements to be considered in future :

  • Better feature extraction from semi-structured data in PDFs, Docx, Images, xlsx and csv.
  • For documents, image description using LLM and then storing it's vector embeddings for retrieval will enrich data for retrieval. Tablular data in documents can be further analyzed using LLM before storing. Graph knowledge base and GraphRAG https://microsoft.github.io/graphrag/ can be very helpful for complex queries to handle entity and complex relationships understanding.
  • For xlsx and csv - structured data, if there is general consistency in categories of columns in all of real estate client data. Then considering a custom approach to for data analysis(may be a agentic approach) can be considered while retreival. In such a case a query analyser / query router (may be using LLM) can analyse type of query : 1. Factual, 2. statistical 3.Relationship-entity query.
  • Better data cleaning, transformation and storage strategy for structured data.
  • Better domain adaptation to Real Estate for specific client. More data for custom fine-tuning and if possible synthetic data generation for fine-tuning a LLM model for feature extraction can be considered.
  • Also design should needs modification where a particular client is only able to have answer related to his own private data and not other client's data. uuid generation for each unique user id and it's use in prompt for specific client's query / restricted q&a using modification using userid in code design can help in this.
  • Better tools like unstructured.io for ETL pipeline of documents can be done.
  • Development of interface for ingesting data for ETL pipline as seperate interface. To be decoupled from chat interface.
  • Using soft prompt tuining for better prompt engineering can be done. New frameworks like DSPy can also be considered https://dspy-docs.vercel.app/ for advanced techniques for RAG.
  • Ensure security using tools like chatbot gaurdrails using tools like NeMo Guardrails : https://blogs.nvidia.com/blog/ai-chatbot-guardrails-nemo/
  • Use better framework for backend(like FastAPI) and frontend(like React)

Some useful Resources :

SYSTEM ARCHITECTURAL FLOW :

diagram-export-8-4-2024-1_17_22-PM

ROUGH WORK

Some Initial experimentation/rough work with huugingface open source models in google collab can be found in this : https://colab.research.google.com/drive/1IWITZ6Ye-CIfMLMM96V0OhPhJpRtiFqp?usp=sharing

  • This collab has more was intended to design a more complex and customised RAG system with multi-vector retrieval and handling images, table in Pdfs, images through image descriptions using LLMs and further steps for better embedding and retrieval.
  • The idea in google collab rough rough was to experiment with multi-dcoument manual ingestion and running ETL pipeline for semi-structired and structured data of multiple complex documents and datasets - for retrieval.
  • The dataset/documents need more preprocessing and deep dynamic feature extraction steps to implemented for better retrieval.
  • Only the vector store and retrieval approach don't help. The quality of data processed and feature extracted also greatly affect the accuracy, quality and variety of retrieval.
  • The retreival will be as good as the processed data and features extracted.

Tests Screenshots :

image

image

image

image

FUNCTION flowchart :

diagram-export-8-4-2024-12_56_04-PM

About

It is a multimodal multi document RAG. Handle pdf, docx, xlsx, csv and image formats using gemini-flash

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published