This module implements data structure for Sparrow ML model fine-tuning. We are using list of invoices to build Hugging Face dataset.
- Install
pip install -r requirements.txt
- Install Poppler, required for pdf2image to work (macos example)
brew install poppler
- Mindee docTR OCR installation with dependencies
pip install torch torchvision torchaudio
pip install python-doctr
- Run OCR on invoices with PDF conversion to JPG
python run_ocr.py
- Run data conversion to Sparrow format
python run_converter.py
Run Sparrow UI to annotate the documents and create key/value pairs.
- Run data preparation task for Donut model fine-tuning. This task will create metadata. It will create Hugging Face dataset with train, validation and test splits for Donut model fine-tuning
python run_donut.py
- Push dataset to Huggung Face Hub. You need to have Hugging Face account and Hugging Face Hub token. Read more: https://huggingface.co/docs/datasets/main/en/image_dataset
python run_donut_upload.py
- Test dataset by using load_dataset and fetching data from Hugging Face Hub
python run_donut_test.py
Set huggingface_key in config.py
- Run
cd api
uvicorn endpoints:app --workers 4
- FastAPI Swagger
http://127.0.0.1:8000/api/v1/sparrow-data/docs
Run in Docker container
- Build Docker image
docker build --tag katanaml/sparrow-data .
- Run Docker container
docker run -it --name sparrow-data -p 7860:7860 katanaml/sparrow-data:latest
- Info
curl -X 'GET' \
'http://127.0.0.1:8000/api-dataset/v1/sparrow-data/dataset_info' \
-H 'accept: application/json'
Replace URL with your own
- Ground truth
curl -X 'GET' \
'http://127.0.0.1:8000/api-dataset/v1/sparrow-data/ground_truth' \
-H 'accept: application/json'
- OCR service
curl -X 'POST' \
'http://127.0.0.1:8000/api-ocr/v1/sparrow-data/ocr' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@inout-20211211_001.jpg;type=image/jpeg' \
-F 'image_url=' \
-F 'sparrow_key=your_key'
Replace URL with your own
-
Create new space - https://huggingface.co/spaces. Follow instructions from readme doc
-
Create huggingface_key secret in space settings
-
In config.py, replace huggingface_key variable with this line of code
huggingface_key: str = os.environ.get("huggingface_key")
- Commit and push code to the space, follow readme instructions. Docker container will be deployed automatically. Example:
https://huggingface.co/spaces/katanaml-org/sparrow-data
- Sparrow Data API will be accessible by URL, you can get it from space info. Example:
https://katanaml-org-sparrow-data.hf.space/api/v1/sparrow-data/docs
Licensed under the Apache License, Version 2.0. Copyright 2020-2023 Katana ML, Andrej Baranovskij. Copy of the license.