Skip to content

devdatalab/parse-pdf-table

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 

Repository files navigation

The big idea

We read in tables from PDFs that have an underlying text layer for these tables.

Quick start

Create a conda environment from the included config file:

conda env create -f parse-pdf-table.yml
conda activate parse-pdf-table

Navigate to the github directory, which on polaris looks usually like:

~/ddl/parse-pdf-table

Install the pdfparser library

pip install --editable .

Minimum example

import pdfparser as parser
import pdb


out_df = parser.parse_pdf_table(pdf_str="~/iec/pc01/district_handbooks/DH_33_2001_KKU.pdf",
                                outfile = "/scratch/kjha/out.csv", page_no=278, dist_thresh=None, num_columns=7)

To parse a pdf, you need to specify a distance threshold dist_thresh, between 10 and 100 seems to work in many cases or specify the number of columns expected in the table num_columns but not both. One of these should always be none. For the use case this was designed on; district handbooks for the pc handbooks, specifying the number of columns works best, since we extract appendix tables that have the same number of columns across pages.

Additional options to the function let you specify a bounding box within which to search for the table, which could be useful if this is used in a pipeline that uses layoutparser to detect tables in a pdf.

In case you want to use this with a table without a text layer, use ocrmypdf locally on the docs to generate a text layer and then use ddlpdfparser on them.

A slightly more involved example: Reading in a table from PC01 district pdf for Nagaland's Mokokchung district

  1 # load packages
  2 from ddlpy.utils.constants import *
  3 from pathlib import Path
  4 from text_eb_processing import longest_run_of_integers, find_appendix_pages
  5 import ddlpdfparser as parser
  6 import pdb
  7
  8
  9 ###################
 10 # Minimum Example #
 11 ###################
 12 pc01_hb = IEC / "pc01/district_handbooks"
 13 pdf_path = pc01_hb / "DH_13_2001_MOK.pdf"
 14
 15 list_matches = find_appendix_pages(pdf_path)
 16 list_appendix = longest_run_of_integers(list_matches)
 17 pdb.set_trace()
 18 # KJ: Change page number
 19 out_df = parser.parse_pdf_table(pdf_str=pdf_path, page_no=178, outfile = f"/scratch/kjha/out_178.csv",
 20                                 dist_thresh=None, num_columns=7, key_col=4)
 21

About

Tool for reading in tables in pdfs with a text layer.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%