search-images: Find text inside images

How to solve a (not so theoretical) problem:

I'm documenting a UI, and a term in the UI has changed. How do I find all the images that use this term? I have 100s (or even 1000s) of images, and I don't want to have to open each one!

The search-images.py Python script searches for text inside images! It's set up to search for images in any media folder inside MicrosoftDocs/azure-docs. It creates a .csv file listing the files that match your search phrase.

Prerequisite

Run Quickstart: Optical character recognition (OCR).
Install the PyGithub package
```
pip install PyGithub  
```
Create a GitHub personal access token. In step 8, set the scope to repo.
Create the following environment variables to be accessed when you run the Python script:
- GH_ACCESS_TOKEN - the token you created from Github
- COMPUTER_VISION_ENDPOINT - the endpoint you created from the OCR Quickstart
- COMPUTER_VISION_SUBSCRIPTION_KEY - the key you created from the OCR Quickstart

Run the script

Edit the file search-image.py and fill out the PUT YOUR DETAILS HERE section with your values. This is where you say what to search for, where to search, and where to write results.
Run search-images.py.
- Go grab a coffee, go to lunch, or find something else to work on.
- For 600 images, the script took approximately 15 minutes to complete. Your milage may vary.

Results

Results are printed to the screen, so that you can watch the progress. They are also added to a .csv file.

If the file contains the search term, it is added to the results with a status of "found".
If the file can't be processed, it is added to the results with a status of "unknown". You'll need to manually inspect these files.
If the file doesn't contain the search term, you won't see it it in the results.

Future directions

When I need it, I'll modify this to search for multiple terms, right now I'm just looking for one.
Search is case sensitive. Modify if necessary to make it case insensitive/
You could adapt the script to process local files, following the example from OCR: Read File using the Read API, extract text - local. Note the sleep time in that loop is 10 times larger than for online files.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
media		media
.gitignore		.gitignore
README.md		README.md
ocr-quickstart.py		ocr-quickstart.py
pygithub.py		pygithub.py
search-images.py		search-images.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

search-images: Find text inside images

Prerequisite

Run the script

Results

Future directions

About

Releases

Packages

Languages

JasonWHowell/search-images

Folders and files

Latest commit

History

Repository files navigation

search-images: Find text inside images

Prerequisite

Run the script

Results

Future directions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages