Is there a way to parse the whole pdf and the tables alone with gmft #12

sahilarora3117 · 2024-07-29T14:00:27Z

Hi, this is an amazing project.
I wanted to integrate this for RAG and wanted to use gmft to parse tables, which parsing normal content in the pdf too. Can you share an example where this is possible.
Thanks

conjuncts · 2024-08-02T14:26:19Z

Hi, great question. There are a few options:

table # type: gmft.CroppedTable
print(' '.join(text for _,_,_,_,text in table.text_positions(outside=True)))

But that won't work if there are multiple tables. That also loses pymupdf's newline placement.

For this use case, I've actually been using native pymupdf:

# Setup
doc = pymupdf.open('notebooks/samples/stats.pdf') # type: pymupdf.Document
table # type: gmft.CroppedTable

# Code
to_dict = table.to_dict()
page_no = to_dict['page_no'] # table.page.page_number
page = doc[page_no]
rect = to_dict['bbox'] # table.bbox
annot = page.add_redact_annot(rect) # https://github.com/pymupdf/PyMuPDF/issues/698
page.apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE) # You can apply multiple redactions, so multiple tables per page should work

for page in doc:
    print(page.get_text())

Pymupdf is a fantastic library. The only reason why I don't have pymupdf in the main lib is the license issue.

So those two are the current options for getting text outside of tables.
When turning the table into content for RAG, I recommend turning the dataframe into markdown

In general, I find gpt models' performance on tables to be as follows:

markdown ~ latex > html > csv-plus* >> tsv ~ csv >> native pdf formatting (space-separated)
*csv-plus: slight modification of csv, where an extra space is after each comma.

After getting the text inside and outside the tables, I simply concatenate. Placing the document in the correct location in the document flow is probably possible with some effort, but unfortunately I don't have an example.

conjuncts · 2024-08-23T04:04:46Z

Hello. I finally wrote some prototypical code that does this. The pymupdf path is definitely higher quality but requires you to abide by the stricter AGPL license. 795f229

conjuncts added the enhancement New feature or request label Aug 23, 2024

conjuncts mentioned this issue Sep 3, 2024

problems facing in gmft #18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to parse the whole pdf and the tables alone with gmft #12

Is there a way to parse the whole pdf and the tables alone with gmft #12

sahilarora3117 commented Jul 29, 2024

conjuncts commented Aug 2, 2024

conjuncts commented Aug 23, 2024

Is there a way to parse the whole pdf and the tables alone with gmft #12

Is there a way to parse the whole pdf and the tables alone with gmft #12

Comments

sahilarora3117 commented Jul 29, 2024

conjuncts commented Aug 2, 2024

conjuncts commented Aug 23, 2024