Skip to content

Search for text in images. Uses cognitive services OCR

Notifications You must be signed in to change notification settings

JasonWHowell/search-images

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

search-images: Find text inside images

search-images

How to solve a (not so theoretical) problem:

I'm documenting a UI, and a term in the UI has changed. How do I find all the images that use this term? I have 100s (or even 1000s) of images, and I don't want to have to open each one!

The search-images.py Python script searches for text inside images! It's set up to search for images in any media folder inside MicrosoftDocs/azure-docs. It creates a .csv file listing the files that match your search phrase.

Prerequisite

  • Run Quickstart: Optical character recognition (OCR).

  • Install the PyGithub package

    pip install PyGithub  
  • Create a GitHub personal access token. In step 8, set the scope to repo.

  • Create the following environment variables to be accessed when you run the Python script:

    • GH_ACCESS_TOKEN - the token you created from Github
    • COMPUTER_VISION_ENDPOINT - the endpoint you created from the OCR Quickstart
    • COMPUTER_VISION_SUBSCRIPTION_KEY - the key you created from the OCR Quickstart

Run the script

  1. Edit the file search-image.py and fill out the PUT YOUR DETAILS HERE section with your values. This is where you say what to search for, where to search, and where to write results.

  2. Run search-images.py.

    • Go grab a coffee, go to lunch, or find something else to work on.
    • For 600 images, the script took approximately 15 minutes to complete. Your milage may vary.

Results

Results are printed to the screen, so that you can watch the progress. They are also added to a .csv file.

  • If the file contains the search term, it is added to the results with a status of "found".
  • If the file can't be processed, it is added to the results with a status of "unknown". You'll need to manually inspect these files.
  • If the file doesn't contain the search term, you won't see it it in the results.

Future directions

  • When I need it, I'll modify this to search for multiple terms, right now I'm just looking for one.
  • Search is case sensitive. Modify if necessary to make it case insensitive/
  • You could adapt the script to process local files, following the example from OCR: Read File using the Read API, extract text - local. Note the sleep time in that loop is 10 times larger than for online files.

About

Search for text in images. Uses cognitive services OCR

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%