Skip to content

Commit 332fee3

Browse files
committed
Updated Automation Script for PDF to Text Conversion using Python
1 parent dffdef0 commit 332fee3

File tree

6 files changed

+70
-51
lines changed

6 files changed

+70
-51
lines changed

AUTOMATION/PDF To Text/README.md

Lines changed: 39 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,49 @@
1-
# Extracting text from PDF using Python
1+
# Extracting Text from PDF using Python
22

3-
Create a new folder and create a pdfToText.py file in it. Copy and paste the code in pdfToText.py in this repository to that file.
3+
This project is aimed at extracting text from PDF files using Python.
44

5-
Open the Terminal:
5+
## Getting Started
66

7-
```py
8-
pip install pdfminer.six
7+
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
98

9+
### Prerequisites
10+
11+
Before running the script, you must install the appropriate dependencies. To install these dependencies, run the following command in your terminal.
12+
13+
```bash
14+
pip install -r requirements.txt
1015
```
1116

12-
In the same folder, add the pdf from which you want to extract text (Here the pdf used is test.pdf). Provide this pdf as a command line argument.
17+
### Using the Tool
1318

14-
Run the script using:
19+
Follow these steps to use the tool:
1520

16-
```py
17-
python3 pdfToText.py test.pdf
21+
1. Run the 'pdfToText.py' script:
1822

19-
```
23+
```bash
24+
python pdfToText.py
25+
```
26+
27+
2. When prompted, provide the full path along with the file name of the PDF from which you want to extract text. For example:
28+
29+
```bash
30+
D:\FolderName\FileName.pdf
31+
```
32+
33+
3. The data from the PDF will be extracted and stored in a .txt file in the same folder. For example:
34+
35+
```bash
36+
D:\FolderName\FileName.txt
37+
```
38+
39+
### Error Handling
40+
41+
If any error is encountered during the process, it will be printed on the screen. For resolution, check the error message and debug accordingly.
42+
43+
Feel free to report any bugs or request features using the issue tracker.
44+
45+
## Example Run and Output
46+
47+
Below is a screenshot demonstrating how to run the commands in the terminal:
2048

21-
The extracted text will be available in converted_pdf.txt
49+
![Sample Usage of the Script](./SampleUsage.png)
31.4 KB
Loading

AUTOMATION/PDF To Text/converted_pdf.txt

Lines changed: 0 additions & 28 deletions
This file was deleted.

AUTOMATION/PDF To Text/pdfToText.py

Lines changed: 29 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,34 @@
1-
import argparse
2-
import pdfminer.high_level
1+
from pathlib import Path
2+
from PyPDF2 import PdfReader
33

4-
# Extract text with Pdfminer.six Module
5-
def With_PdfMiner(pdf):
6-
with open(pdf,'rb') as file_handle_1:
7-
doc = pdfminer.high_level.extract_text(file_handle_1)
84

9-
with open('converted_pdf.txt','w') as file_handle_2 :
10-
file_handle_2.write(doc)
5+
def convert_pdf(filename):
6+
my_file = Path(filename)
7+
8+
# Check if provided PDF file exists
9+
if not my_file.is_file():
10+
print('Error! File Not Found!')
11+
return None
12+
print('PDF Found! Attempting Conversion...')
13+
14+
# Exception Handling during Data Extraction from PDF File
15+
try:
16+
# Define .txt file which will contain the extracted data
17+
out_filename = my_file.with_suffix('.txt')
18+
# Extracting Data from PDF file page-by-page and storing in TXT file
19+
pdf_reader = PdfReader(filename)
20+
with open(out_filename, 'w', encoding='utf-8') as extracted_data:
21+
for page in pdf_reader.pages:
22+
text = page.extract_text()
23+
extracted_data.write(text)
24+
print('PDF to TXT Conversion Successful!')
25+
26+
# If any Error is encountered, Print the Error on Screen
27+
except Exception as e:
28+
print(f'Error Converting PDF to Text or Saving Converted Text into .txt file: {e}')
29+
return None
1130

1231

1332
if __name__ == '__main__':
14-
parser = argparse.ArgumentParser()
15-
parser.add_argument("file", help = "PDF file from which we extract text")
16-
args = parser.parse_args()
17-
With_PdfMiner(args.file)
33+
file = input('Enter Full Path and FileName: ')
34+
convert_pdf(file)
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
PyPDF2
2+
pathlib

AUTOMATION/PDF To Text/test.pdf

-7.76 KB
Binary file not shown.

0 commit comments

Comments
 (0)