Math Expression Detection

Detect mathematical expressions in worksheets and draw bounding boxes.

Examples

How is it done?

Scraped data from Bing for the keyword "math worksheets" using google-images-download.
Annotated ~50 worksheets, assigning 0 to non-math expressions and 1 to math expressions. Alias as MathWorksheetsOCR dataset.
Using the CRAFT: Character-Region Awareness For Text detection detect general purpose OCR.
A trained binary classifier using BERT removes non-mathematical expressions using the annotated data.
Non-maximal supression to combine multiple intersecting bounding-boxes together.
Plot the bounding boxes over the images.

Data

data/train: The MathWorksheetsOCR dataset. List of 50 worksheets hand annotated for the binary classification. Total number of expresisons 2332. Where Math expression are 1859 and non-math expressions are 473. The dataset is skewed with 80% math experssions becuase math worksheets have mostly math expressions.
image-dataset/bing-scrap-dataset: 100 worksheets scraped from Bing.
image-dataset/worksheets: Used these 10 exmaples for our development set.
image-dataset/handwritten: Handwritten sheets provided.

Code

boundingbox.py: Takes in image folder. Computes bounding box. Plots them.
train_classifier.py: Takes in the annotated data exmaples. Trains a binary classifier on top of BERT.
classifier.py: Loads up trained BERT classifier. Runs inference.
data.py: Custom PyTorch Dataset class for Math Expressions.
non_maximal_supression.py: Performs non maximal supression. Credit

How was MathWorksheetsOCR created?

Scraped 50 worksheets from Bing.
Used the easyOCR to recognize text from each worksheet.
Hand annonated the recognized text as either 0 or 1.
The final dataset size is,

How was BERT classifer trained?

Used transformers to fine-tune BertForSequenceClassification on the MathWorksheetsOCR dataset.
The fine-tuned model is available at this Google Drive link.

How does the final detection work?

Every image is passed through easyOCR to get both bounding boxes and the text for each box.
All the non-math expressions text is removed using the trained BERT classifier.
Non-maximal supression is applied to all the bounding boxes to combine intersecting windows.
Plot the final boxes and save them in bb folder.
Voila!

What did I observe?

The results for 3 different datasets can be viewed at image-dataset/bing-scrap-dataset/bb, image-dataset/handwritten/bb, image-dataset/worksheets/bb.
The detection is working well even for difficut exmaples, where the expressions are parted into two lines because of non-maximal supression.
All the non-math text, instructions like "Solving Quadratic Equations", and question numbers like "2b.", "3)", any other irrelevant text at the end of the worksheet are removed.
The precision without the BERT classifier was low, becuase a number of non-math noise was included in the predictions. After using the BERT classifier, the preciiosn increased.
I observed all these using qualative analysis. For quantative analysis, like computing precision/recall using IOU, ground truth bounding box for the data is required.

What didn't work?

I tried using ScanSSD pre-trained on datasetname. However, the results were not accurate. I believe this is because ScanSSD is trained on Math latex expressions, whereas we wanted it to perform on Math worksheets. Thereby the decision to create annotated examples.
Used perplexity from GPT-2 to remove non-math expression. I assumed that math expression perplexity would be higher than non-math expressions. However, no significant difference observed between them.

Final Thoughts

A better approach to solve this problme would be from ground-up constructing an annotated dataset for these math worksheets. These annotations should be bounding-boxes.
Perhaps, we can use Amazon Mechanical Turk to annotate different distribution of data. Example, hand-written, camera captured sheets, etc.
Using IOU, intersection over union, to compute precision and recall of the bounding boxes. Since, our dataset was not annotated at the moment, we used human evaluation for the results.
Unsupervised clustering of BERT embeddings of math and non-math text for removing noise.
Deep Learning works (sorta)!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Math Expression Detection

Examples

How is it done?

Data

Code

How was MathWorksheetsOCR created?

How was BERT classifer trained?

How does the final detection work?

What did I observe?

What didn't work?

Final Thoughts

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
__pycache__		__pycache__
data		data
image-dataset		image-dataset
images		images
README.md		README.md
boundingbox.py		boundingbox.py
classifier.py		classifier.py
data.py		data.py
model.py		model.py
non_maximal_supression.py		non_maximal_supression.py
train_classifier.py		train_classifier.py
utils.py		utils.py

divya1211/math-expression-detection

Folders and files

Latest commit

History

Repository files navigation

Math Expression Detection

Examples

How is it done?

Data

Code

How was MathWorksheetsOCR created?

How was BERT classifer trained?

How does the final detection work?

What did I observe?

What didn't work?

Final Thoughts

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages