You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, this is an amazing project.
I wanted to integrate this for RAG and wanted to use gmft to parse tables, which parsing normal content in the pdf too. Can you share an example where this is possible.
Thanks
The text was updated successfully, but these errors were encountered:
table # type: gmft.CroppedTable
print(' '.join(text for _,_,_,_,text in table.text_positions(outside=True)))
But that won't work if there are multiple tables. That also loses pymupdf's newline placement.
For this use case, I've actually been using native pymupdf:
# Setupdoc=pymupdf.open('notebooks/samples/stats.pdf') # type: pymupdf.Documenttable# type: gmft.CroppedTable# Codeto_dict=table.to_dict()
page_no=to_dict['page_no'] # table.page.page_numberpage=doc[page_no]
rect=to_dict['bbox'] # table.bboxannot=page.add_redact_annot(rect) # https://github.com/pymupdf/PyMuPDF/issues/698page.apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE) # You can apply multiple redactions, so multiple tables per page should workforpageindoc:
print(page.get_text())
Pymupdf is a fantastic library. The only reason why I don't have pymupdf in the main lib is the license issue.
So those two are the current options for getting text outside of tables.
When turning the table into content for RAG, I recommend turning the dataframe into markdown
In general, I find gpt models' performance on tables to be as follows:
markdown ~ latex > html > csv-plus* >> tsv ~ csv >> native pdf formatting (space-separated)
*csv-plus: slight modification of csv, where an extra space is after each comma.
After getting the text inside and outside the tables, I simply concatenate. Placing the document in the correct location in the document flow is probably possible with some effort, but unfortunately I don't have an example.
Hello. I finally wrote some prototypical code that does this. The pymupdf path is definitely higher quality but requires you to abide by the stricter AGPL license. 795f229
Hi, this is an amazing project.
I wanted to integrate this for RAG and wanted to use gmft to parse tables, which parsing normal content in the pdf too. Can you share an example where this is possible.
Thanks
The text was updated successfully, but these errors were encountered: