Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
examples		examples
pdftabextract		pdftabextract
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

pdftabextract - A set of tools for data mining (OCR-processed) PDFs

July 2016, Markus Konrad [email protected] / Berlin Social Science Center

Introduction

This repository contains a set of tools written in Python 3 with the aim to extract tabular data from (OCR-processed) PDF files. Before these files can be processed they need to be converted to XML files in pdf2xml format. This is very simple -- see section below for instructions.

After that you can view the extracted text boxes with the pdf2xml-viewer tool if you like. The pdf2xml format can be loaded and parsed with functions in the common submodule. When the pages are skewed, you will need to straighten them before you can process them further. This can be done with the fixrotation submodule. Afterwards you can extract tabular data from these files and output the data in CSV or JSON format using the tabextract submodule.

Features

load and parse files in pdf2xml format (common submodule)
straighten skewed pages (fixrotation submodule)
extract tabular data from pdf2xml files and output the data in CSV or JSON format (tabextract submodule)

Requirements

The requirements are listed in requirements.txt. You basically need a scientific Python software stack installed (for example via Anaconda or pip) with the following libraries:

numpy
scipy

The scripts were only tested with Python 3. They might also work with Python 2.x with minor modifications.

Converting PDF files to XML files with pdf2xml format

You need to convert your PDFs using the poppler-utils, a package which is part of most Linux distributions and is also available for OSX via Homebrew or MacPorts. From this package we need the command pdftohtml and can create an XML file in pdf2xml format in the following way using the Terminal:

pdftohtml -c -i -hidden -xml input.pdf output.xml

The arguments input.pdf and output.xml are your input PDF file and the created XML file in pdf2xml format respectively. It is important that you specifiy the -hidden parameter when you're dealing with OCR-processed ("sandwich") PDFs. You can furthermore add the parameters -f n and -l n to set only a range of pages to be converted.

Usage and examples

For usage and background information, please read my series of blog posts about data mining PDFs.

You should have a look at the examples to see how to use the provided functions and configuration settings. Examples are provided in the examples directory. Remember to set the PYTHONPATH according to where you put the pdftabextract package. You can run an example straight from the root dictionary with PYTHONPATH=. python examples/process_ocr_output.py (note: your Python 3 executable might be named python3).

Alternatively, you can use an IDE like Spyder.

See the following images of the example input/output:

Original OCR-processed ("sandwich") PDF

Generated (and skewed) pdf2xml file viewed with pdf2xml-viewer

Straightened file

Extracted data (CSV file imported to LibreOffice)

License

Apache License 2.0. See LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdftabextract - A set of tools for data mining (OCR-processed) PDFs

Introduction

Features

Requirements

Converting PDF files to XML files with pdf2xml format

Usage and examples

License

About

Releases

Packages

Languages

License

Zbom/pdftabextract

Folders and files

Latest commit

History

Repository files navigation

pdftabextract - A set of tools for data mining (OCR-processed) PDFs

Introduction

Features

Requirements

Converting PDF files to XML files with pdf2xml format

Usage and examples

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages