Skip to content

Commit ec8c1ba

Browse files
Merge pull request avinashkranjan#262 from vybhav72954/iss_168
Added Text Extract with additional functionalities
2 parents eb8b816 + 0bbdacd commit ec8c1ba

File tree

6 files changed

+114
-0
lines changed

6 files changed

+114
-0
lines changed

Text_Extract_Images/README.md

+56
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Text_Extract
2+
3+
[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)
4+
5+
Text extraction form Images, OCR, Tesseract, Basic Image manipulation are all important yet very basic scripts.
6+
7+
This script uses ```pytesseract``` for text extraction from images, considering it only recognizes text and can
8+
only print it, this script additionally adds a functionality to write the text in a `txt` and/or `csv` file.
9+
10+
## Setup instructions
11+
12+
- Setup a `python 3.x` virtual environment.
13+
- `Activate` the environment
14+
- Install the dependencies using ```pip3 install -r requirements.txt```
15+
- You are all set and the [script](text_extract.py) is Ready to run.
16+
- Carefully follow the Instructions.
17+
18+
## Further Readings
19+
20+
Some newcomers for the first time struggle with Tesseract, this is a direct link to the
21+
[installer](https://github.com/UB-Mannheim/tesseract/wiki)
22+
23+
Setting up OCR can be found [here](http://bit.ly/2MClAwD)
24+
25+
__PATH__ env variable can help in optimizing the code.
26+
[This](http://bit.ly/35d3c3Q) and [this](http://bit.ly/3ba0zmZ) link will help you in order to achieve that.
27+
28+
## Usage
29+
30+
Just make sure that Tesseract is in proper directory, run the code according the comments and guidelines.
31+
32+
```
33+
Smaple -
34+
Enter the Folder name containing Images: <Name of Folder>
35+
Enter your desired output location: <Name of Folder>
36+
```
37+
38+
## Output
39+
40+
Output
41+
42+
![Output](img/Output.PNG)
43+
44+
Image containing Text
45+
46+
![Before Compression](img/Sample.PNG)
47+
48+
After Extraction
49+
50+
![After Backup](img/TextFile.PNG)
51+
52+
53+
## Author(s)
54+
55+
Made by [Vybhav Chaturvedi](https://www.linkedin.com/in/vybhav-chaturvedi-0ba82614a/)
56+

Text_Extract_Images/img/Output.PNG

3.43 KB
Loading

Text_Extract_Images/img/Sample.PNG

14.3 KB
Loading

Text_Extract_Images/img/TextFile.PNG

13.3 KB
Loading

Text_Extract_Images/requirements.txt

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
pytesseract==0.3.6
2+
Pillow==8.0.1

Text_Extract_Images/text_extract.py

+56
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
from PIL import Image
2+
import pytesseract as pt
3+
import os
4+
from pathlib import Path
5+
6+
7+
current_location = (os.getcwd() + '\\')
8+
9+
10+
def extract():
11+
"""
12+
Function for extracting text from images.
13+
Additional it saves the text extracted as a txt file.
14+
"""
15+
16+
# Enter the name of folder which contains img files
17+
image_location = input("Enter the Folder name containing Images: ")
18+
image_path = os.path.join(current_location, image_location)
19+
20+
# Enter the name of folder which would contain respective txt files
21+
destination = input("Enter your desired output location: ")
22+
destination_path = os.path.join(current_location, destination)
23+
24+
# Path to Tesseract
25+
tesseract_path = input("Enter the Path to Tesseract: ")
26+
print('\nNOTE: '
27+
'It is preferable to setup the PATH variable to Tesseract, see README. \n')
28+
29+
# = r'C:\Program Files\Tesseract-OCR\tesseract'
30+
pt.pytesseract.tesseract_cmd = tesseract_path
31+
32+
# iterating over the images in the folder
33+
for imageName in os.listdir(image_path):
34+
35+
# Join the path and image name to obtain absolute path
36+
inputPath = os.path.join(image_path, imageName)
37+
img = Image.open(inputPath)
38+
39+
# OCR
40+
text = pt.image_to_string(img, lang="eng")
41+
42+
# Removing extensions
43+
img_file = Path(inputPath).stem
44+
print(img_file)
45+
46+
# The output text file
47+
text_file = img_file + ".txt"
48+
output_path = os.path.join(destination_path, text_file)
49+
50+
# saving the text for every image in a separate .txt file
51+
with open(output_path, "w") as file:
52+
file.write(text)
53+
54+
55+
if __name__ == '__main__':
56+
extract()

0 commit comments

Comments
 (0)