Topics from PDF

This repository contains the source code for the web application: Topics from PDF. The application enables users to upload a PDF document and, regardless of its size, extract the main topics from it. The application supports responses in English as well as in Spanish. It utilizes the model serving HugingChat as its language model, through an Unofficial HuggingChat Python API; further details are provided below.

Requirements

Python version: 3.10
Libraries: langchain, pypdf, nltk, gensim, gradio, hugchat.
Installation: requirements.txt

Usage

Web App

The simplest way to try the app is to follow the link:

https://topicspdf.dsapp.me

It was deployed with Docker and using the serverless service from Google, Cloud Run.

Local installation

To run the web app on a local machine running Linux, install python 3.10 and git. Then, run on the terminal:

export [email protected]
export HF_PW_1=your_HuggingChat_password

git clone https://github.com/a-jimenezc/topics_from_pdf.git
cd  topics_from_pdf
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python app.py

HF_EMAIL_1 and HF_PW_1 are the credentials for a HuggingChat account. It is required to register an account.

Project Description

The main goal of this project is to enable the extraction of the main topics and ideas from a PDF document, regardless of its size and without incurring any additional cost. This was achieved with the help of open source libaries and resources:

Summarization with LDA

LLMs can summarize documents, but the computational cost for large documents makes using them for this task prohibitively expensive. So, some preprossessing was needed. LDA (Latent Dirichlet Allocation) is a great algorithm for document processing. It produces a list of words per topic; it also allows the selection of the number of topics and words per topic. Then, the output word lists could be feeded into an LLM and ask, prompting it to articulate a description using natural language. This approach aids in extracting the core ideas from the document. In this case, the "summary" is given in a table-of-content format.

HugChat, Unofficial Hugging Chat API

The library hugchat by Soulter offers an unofficial API for HuggingChat. Currently, the model powering HuggingChat is mistralai/Mixtral-8x7B-Instruct-v0.1, but this can change over time. So, the terms of use, limitations, caveats, and licencing stipulated by HuggingChat and the model it is using apply when using this web application. Please, visit the oficial documentation for more information.

Gradio

The application was built with gradio. It offers back-end and front-end support for machine learning applications. Also, they have an exelent support for language models.

LangChain, the orchestration library

LangChain made this project possible. It offers a rich set of tools for working with LLMs, including template for prompts, vector databases, and more.

Licence

GNU General Public License v3.0

Disclaimer

This application relies on third-party libraries and resources. Consequently, its utilization is subject to specific terms of use, conditions, and licenses that pertain to these external libraries and resources.

Author

Antonio Jimenez Caballero

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
pages		pages
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_es.md		README_es.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topics from PDF

Requirements

Usage

Web App

Local installation

Project Description

Summarization with LDA

HugChat, Unofficial Hugging Chat API

Gradio

LangChain, the orchestration library

Licence

Disclaimer

Author

About

Releases

Packages

Languages

License

a-jimenezc/topics_from_pdf

Folders and files

Latest commit

History

Repository files navigation

Topics from PDF

Requirements

Usage

Web App

Local installation

Project Description

Summarization with LDA

HugChat, Unofficial Hugging Chat API

Gradio

LangChain, the orchestration library

Licence

Disclaimer

Author

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages