ML Text Summarization project
This project is to build and deploy:
- A summarization model by using transfer learning from a pretrained T5 model and fine-tuning with BBC News summary dataset.
- A backend application to initialize and serve the summarization model.
- A REACT web application that input text and output summarized text. The application will provide evaluation metrics when reference summary is provided.
The following instructions should work on Linux, Windows and MacOS. If you are a Windows user familiar with Linux, you should check out the Windows Subsystem for Linux, Version 2 (WSL2). This allows to use a Linux system on the Windows machine. However, using native Windows should also be no problem.
It is helpful to install git
on your machine, but you can also download the full repository from Github as a zip file. If you use git
, run the following commands from the command line:
git clone https://github.com/furyhawk/text_summarization.git
cd text_summarization
Skip to Backend and Frontend setup if you do not need to run the notebooks.
For local setup, we recommend to use Miniconda, a minimal version of the popular Anaconda distribution that contains only the package manager conda
and Python. Follow the installation instructions on the Miniconda Homepage.
After installation of Anaconda/Miniconda, run the following command(s) from the project directory:
conda env create --name text --file text.yml
conda activate text
For windows: download git-lfs from https://git-lfs.github.com/ and install
git lfs install
Now you can start the Jupyter lab server:
jupyter lab
If working on WSL under Windows, add --no-browser
.
If you need to fine tune your own model, sign up free at https://huggingface.co/ . Login to huggingface as needed:
huggingface-cli login
docker-compose -f docker-compose.yml up -d
Test frontend on http://localhost:3000/ Test backend on http://localhost:8000/docs
Do note that the Transformer will download up to 2GB of models.
You can skip below container setup.
from /backend
cd backend
run
uvicorn app.main:app --host 0.0.0.0 --port 8000
or
python ./app/main.py
cd frontend\textsum
npm install
npm start
This will create a new browser tab with Summarization App in DEV env. Run again using just 'npm start'.
To push an image to Docker Hub, you must first name your local image using your Docker Hub username and the repository name that you created through Docker Hub on the web.
You can add multiple images to a repository by adding a specific : to them (for example docs/base:testing). If it’s not specified, the tag defaults to latest.
docker build . -t <hub-user>/textsum_endpoint:latest
docker push <hub-user>/textsum_endpoint:latest
This dataset was created using a dataset used for data categorization that onsists of 2225 documents from the BBC news website corresponding to stories in five topical areas from 2004-2005 used in the paper of D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006; whose all rights, including copyright, in the content of the original articles are owned by the BBC. More at http://mlg.ucd.ie/datasets/bbc.html
https://github.com/huggingface/notebooks/blob/master/examples/summarization.ipynb
https://github.com/blueprints-for-text-analytics-python/blueprints-text