Skip to content

Latest commit

 

History

History
 
 

ocr

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

OCR Improves Machine Translation for Low-Resource Languages

This folder contains the scripts to run the data preparation and evaluation for the following paper.

@inproceedings{ignat2022ocr,
  author = "Oana Ignat and Jean Maillard and Vishrav Chaudhary and Francisco Guzmán",
  title = "OCR Improves Machine Translation for Low-Resource Languages",
  booktitle = "Findings of ACL 2022, Long Papers",
  year = 2022
}

Contents:

Setup:

  1. Install Tesseract v4.
  2. Install the Python requirements:
pip install -r requirements.txt
  1. To use the Google Vision API, set up the authentication with Google Cloud
  2. You mat need to change the CHROME_PATH value from data_collection/augment_data.py to the location where Google Chrome is in your computer.