Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to parse the whole pdf and the tables alone with gmft #12

Open
sahilarora3117 opened this issue Jul 29, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@sahilarora3117
Copy link

Hi, this is an amazing project.
I wanted to integrate this for RAG and wanted to use gmft to parse tables, which parsing normal content in the pdf too. Can you share an example where this is possible.
Thanks

@conjuncts
Copy link
Owner

Hi, great question. There are a few options:

table # type: gmft.CroppedTable
print(' '.join(text for _,_,_,_,text in table.text_positions(outside=True)))

But that won't work if there are multiple tables. That also loses pymupdf's newline placement.

For this use case, I've actually been using native pymupdf:

# Setup
doc = pymupdf.open('notebooks/samples/stats.pdf') # type: pymupdf.Document
table # type: gmft.CroppedTable

# Code
to_dict = table.to_dict()
page_no = to_dict['page_no'] # table.page.page_number
page = doc[page_no]
rect = to_dict['bbox'] # table.bbox
annot = page.add_redact_annot(rect) # https://github.com/pymupdf/PyMuPDF/issues/698
page.apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE) # You can apply multiple redactions, so multiple tables per page should work

for page in doc:
    print(page.get_text())

Pymupdf is a fantastic library. The only reason why I don't have pymupdf in the main lib is the license issue.

So those two are the current options for getting text outside of tables.
When turning the table into content for RAG, I recommend turning the dataframe into markdown

In general, I find gpt models' performance on tables to be as follows:

markdown ~ latex > html > csv-plus* >> tsv ~ csv >> native pdf formatting (space-separated)
*csv-plus: slight modification of csv, where an extra space is after each comma.

After getting the text inside and outside the tables, I simply concatenate. Placing the document in the correct location in the document flow is probably possible with some effort, but unfortunately I don't have an example.

@conjuncts conjuncts added the enhancement New feature or request label Aug 23, 2024
@conjuncts
Copy link
Owner

Hello. I finally wrote some prototypical code that does this. The pymupdf path is definitely higher quality but requires you to abide by the stricter AGPL license. 795f229

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants