Name		Name	Last commit message	Last commit date
parent directory ..
parsed		parsed
.gitignore		.gitignore
.parcelrc		.parcelrc
README.md		README.md
cloud_vision_batch		cloud_vision_batch
cloud_vision_cmd		cloud_vision_cmd
confusables.js		confusables.js
get_images_from_pdf		get_images_from_pdf
ignorePages.js		ignorePages.js
index.html		index.html
index.js		index.js
manual.js		manual.js
nodeRunner.js		nodeRunner.js
package-lock.json		package-lock.json
package.json		package.json
request.json		request.json
viewer.html		viewer.html
viewer.js		viewer.js

README.md

Image parsing process

Image extraction

Images are extracted from the pdf using pdfimages

pdfimages -png input.pdf out

Google Cloud Vision OCR

Upload the images to a Google Storage Bucket and use the Cloud Vision API (see cloud_vision_batch file) to scan the text with OCR.

Automated parsing

Automated parsing is done with the parseData function in index.js. This function is called by viewer.js for the browser (run npm run start) and nodeRunner.js for NodeJS.

Manual parsing

After the manual parsing, use manual.js to manually correct the JSON (built with React)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

anniversary_format

anniversary_format

README.md

Image parsing process

Image extraction

Google Cloud Vision OCR

Automated parsing

Manual parsing

Files

anniversary_format

Directory actions

More options

Directory actions

More options

Latest commit

History

anniversary_format

Folders and files

parent directory

README.md

Image parsing process

Image extraction

Google Cloud Vision OCR

Automated parsing

Manual parsing