GitHub - GodIwakuraLain/llm_aided_ocr: Enhance Tesseract OCR output for scanned PDFs by applying Large Language Model (LLM) corrections.

Changes

Support for Third-Party OpenAI APIs: Now compatible with third-party OpenAI APIs, such as one-api and new-api. Customize the BASE_URL via the .env file.
Asynchronous Processing Toggle: Introduced an option to enable/disable asynchronous processing, aiding in the management of API rate limits. Controlled by the ASYNC_API_REQUESTS setting in the .env file.
Retry Mechanism: Implemented a retry mechanism for API requests, with a default of 3 retries and a 10-second delay between each attempt.
Temporary File Storage for pdf2image Conversion: pdf2image conversion results are now temporarily stored in a .temp_pdf_images folder within the current directory.
Skip Image Conversion if Already Exists: The script will now automatically skip the image conversion step if corresponding images already exist within the .temp_pdf_images directory (checks only for the existence of the first image to determine if conversion was previously completed).
Skip Tesseract OCR if Output File Already Exists: Added logic to bypass Tesseract OCR processing if the raw output file (filename_raw_ocr_output.txt) already exists.
Moved Settings to .env: Configuration settings have been relocated to a .env file for easier management.

OPENAI_BASE_URL="http://XXXX/v1" (OpenAI base URL. For services like one-api, new-api, etc., append /v1 to the URL)
OPENAI_MAX_TOKENS=8194 (Maximum token limit for OpenAI requests)
ASYNC_API_REQUESTS=False (Toggle for asynchronous API requests. Useful for managing rate limits)
INPUT_PDF_FILE_PATH=xxxx.pdf (Path to the PDF file you wish to process)

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
docker		docker
.gitignore		.gitignore
.python-version		.python-version
160301289-Warren-Buffett-Katharine-Graham-Letter.pdf		160301289-Warren-Buffett-Katharine-Graham-Letter.pdf
160301289-Warren-Buffett-Katharine-Graham-Letter__raw_ocr_output.txt		160301289-Warren-Buffett-Katharine-Graham-Letter__raw_ocr_output.txt
160301289-Warren-Buffett-Katharine-Graham-Letter_llm_corrected.md		160301289-Warren-Buffett-Katharine-Graham-Letter_llm_corrected.md
README.md		README.md
llm-aided-ocr-cli.py		llm-aided-ocr-cli.py
llm_aided_ocr.py		llm_aided_ocr.py
requirements.txt		requirements.txt