Simple Python Optical Character Recognition using PyTesseract, to extract date stamps from scanned photos
This is a hobby project to try and better organise a large batch of old family photos I scanned a while back. The desired end goal is for all of the scanned photos which have a red date stamp in the corner to have a filename containing that date.
As this is a non-trivial computer vision problem, this may be a total failure, but the idea is to try and narrow down the problem with as many reasonable assumptions as possible.
For example, hopefully:
- we only need to detect 10 characters - digits 0-9
- we only need to support the single 7-segment digit font used in those old image date stamps
- we can assume the date stamp was imprinted in a (hopefully fairly consistent) shade of red
If these prove to be trye, we should then be able to apply various image transforms and filters first to leave only the digit segments, binarize this and put it through tesseract with a carefully trained model just for these digits to do the actual OCR.
- Clone the repository
$ git clone [email protected]:beveradb/scanned-photos-date-stamp-ocr.git
- Check into the cloned repository
$ cd scanned-photos-date-stamp-ocr
- If you are using Pipenv, setup the virtual environment and start it as follows:
$ pipenv install && pipenv shell
- Install the requirements
$ pip install -r requirements.txt
- Run the script to process the example images
python ocr.py
This is a low commitment personal hobby project - feel free to fork and do whatever you like, contributions will be gratefully accepted and merged, but please don't expect much from me!