Project submission for AWS DeepLens Challenge
For this project, I wanted to build an application that could read books to children. In order to achieve this, I designed a workflow which performs the following steps.
- Determine when a page with text is in the camera frame
- Clean up the image using OpenCV
- Perform OCR (Optical Character Recognition)
- Transform text into audio using AWS Polly
- Play back the audio through speakers plugged into DeepLens
My dataset was made from hundreds of photos of my kids' books as well as a number of library books taken in various lighting conditions, orientations, and distances. I used labelImg to annotate my dataset with bounding boxes so I could train the model to identify Text Blocks on a page.
The Model was trained using MXNet using a VGG 16 model as a base. The steps used for training are outlined in this notebook
This project is built using GreenGrass, Python, MXNet, OpenCV, Tesseract, and AWS Polly.
In order to get the Text Area cleaned up to perform OCR, the program needs perform pre-processing on the image using a number of filters in OpenCV. This graphic shows an example of the steps that ReadToMe goes through with each image before trying to turn the image into text.
The lambda consists of three main files.
- readToMeLambda.py
- Contains main workflow for project (imports imageProcessing.py)
- imageProcessing.py
- Contains helper functions used for image and text cleanup
- speak.py
- Contains helper functions used to call AWS Polly and synthesize the audio
Because the user has no way to tell the DeepLens when a book is in front of the camera, we use the model to detect blocks of text on the page. When we find a text block, we isolate the image using the getRoi() function inside of imageProcessing.py.
Another important step that is performed is correctSkew() in imageProcessing.py. This warps/rotates the text block to try to make the text horizontal. If the text is angled or skewed, there will be problems when trying to do OCR.
Finally we remove any non utf-8 characters after doinc ocr. RemoveNonUtf8BadChars() in imageProcessing.py. This step just attempts to clean up the text before turning the text to speach.
To run this project on the deeplens, you will need to first install a few packages from a terminal window on the DeepLens.
sudo apt-get update && sudo apt-get install tesseract-ocr && apt-get install python-gi
Also, in order to get the audio to work on DeepLens, I had to perform the following steps:
- Log into the DeepLens and take any audio .mp3 file and double click it.
- A prompt will open up asking you to install some required packages which are needed to play back audio on the device
- You will need to enter an administrator password to proceed
Instructions for creating a DeepLens project can be found in the online documentation
The model files for this project are located here: https://github.com/alexschultz/ReadToMe/tree/master/mxnet-model
You will need to tar up the files and put them in S3 when you create the project for the DeepLens. See the official AWS documentation for detailed instructions
The Lambda function code is contained inside the lambda directory in this repo.
In order to build the lambda function, you will need to install the project dependencies and package up the lambda to be deployed to AWS. This needs to be done from a linux machine as the dependencies are OS specific.
To install the pip packages, cd into the lambda directory and run the following command:
pip install -r requirements.txt
Once you have the pip packages installed, you need to bundle up the files into a distributable so it can be uploaded to the Lambda console.
I have included a helper python script to package up the lambda in the format that AWS requires See the documentation referenced above for more details on how to create a custom DeepLens project.
If you have any questions or find any issues with this project, please open an issue, Thanks!