Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embed_tables misaligning the tables in the rich_page #45

Open
ibrahim-jaafar opened this issue Feb 7, 2025 · 3 comments
Open

embed_tables misaligning the tables in the rich_page #45

ibrahim-jaafar opened this issue Feb 7, 2025 · 3 comments

Comments

@ibrahim-jaafar
Copy link

ibrahim-jaafar commented Feb 7, 2025

i am facing a problem when using embed_tables to create a rich page, as you can see in the pdf attached, the table is placed on the top of page, however, when i run my code, the table will be placed at the bottom of the rich page.

test2.pdf

from gmft.auto import AutoTableDetector, AutoTableFormatter
from gmft_pymupdf import PyMuPDFDocument
from gmft._rich_text.rich_page import embed_tables

detector = AutoTableDetector()
formatter = AutoTableFormatter()


def ingest_pdf(pdf_path):  # produces list[CroppedTable]
    doc = PyMuPDFDocument(pdf_path)  # parser to open and process the document
    tables = []
    for page in doc:
        cropped_tables = detector.extract(page)  # Detect tables
    
        formatted_tables = [
            formatter.format(table) for table in cropped_tables
        ]  #  Format tables
        tables.extend(formatted_tables)  # Add formatted tables to the list
    return tables, doc


tables, doc = ingest_pdf("insert path to pdf")

rich_pages = embed_tables(doc=doc, tables=tables)
doc.close()

for page in rich_pages:
    print(page.get_text()) 
    print("\n" + "-"*50 + "\n")  # Print a separator between pages

output:

Page 54
Journal of Pyrotechnics, Issue 10, Winter 1999
Results
Powder performance was determined using a
test apparatus designed to simulate the ap-
proximate conditions in the firing of aerial
shells.[14] Results are shown in Tables 2 and 3.
The interpretation is rather straightforward, and
only a few comments are needed. The best per-
formance was obtained first from the Willow
charcoal obtained from Guy Lichtenwalter and
the Black Willow based powder from Jack
Fielder. The author’s NLC based powder was a
significant performer as well. Note that Goex
brand Black Powder gives results that are lower
than most of the handmade samples. The Fielder
Buckthorn-based powder and the author’s Ai-
lanthus-based powder also performed respecta-
bly.
Future Research
The production of the best charcoal from
Carolina Buckthorn and Alder Buckthorn is still
being studied. It is possible that these Buckthorn
varieties require more careful drying before
pyrolysis than other types of wood. The high
performance of Ailanthus also merits more re-
search to elucidate the relationship between the
physical and chemical properties of the wood
with the charcoal produced from it.
There have been numerous pyrogolf compe-
titions over the past few years, and it is quite
likely that the best charcoal from any one wood
species has yet to be made. One PGI pyrogolf
participant very nearly won the first event with
a Maple based lift powder. Another participant
made a very good powder from Red Cedar.
The author plans to obtain scanning electron
micrographs of several of the charcoals dis-
cussed here. A heuristic method of determining
the degree of graphite structure in the charcoal
will then be applied. In a related study, the
volatile components of a particular charcoal
could potentially be removed with solvents.
Then the charcoal would be compared to itself,
with and without these volatile components.
Oglesby[16] indicates that powder made from so-
called “stripped” charcoal is just as fast as or
faster than the original.
The effects of the various pressing methods
also need to be studied. It is clear that, in gen-
eral, the lower the density of the grains, the
faster the powder.
Another aspect of a given charcoal is the
percentage used to produce the powder. All of
the experimental lift powders discussed in this
work and by O’Neill[17] use the 15/3/2 Waltham
Abbey proportions, but a given charcoal may
produce better results in a 6/1/1, 25/5/4 or even
5/1/1 mixture. This has not been studied.
Finally, a significant aspect of commercial
Black Powder should be examined. Namely,
Goex brand powder burns significantly cleaner
than the handmade lift powders discussed here.
Table 3. Results of Powder Tests (3.5 gram Samples).
|    | Charcoal Type                            | Ave. Velocity (ft/s)   | Ave. Peak Pressure (psi)   |
|---:|:-----------------------------------------|:-----------------------|:---------------------------|
|  0 | Skylighter Air Float(Hubing)             | 70                     | —                          |
|  1 | Carolina Buckthorn (Judd) Aspen (Hubing) | 226 237                | 117 —                      |
|  2 | Thinleaf Alder                           | 270                    | 240                        |
|  3 | Ailanthus                                | 328                    | 396                        |
|  4 | Alder Buckthorn (Fielder)                | 445                    | 762                        |
|  5 | Black Willow (Fielder)                   | 473                    | 819                        |

Note that the best powders in Table 3 outperform those in Table 2. If the Aspen performance is used as a base-
line, the values for speed in Table 3 should be multiplied by 1.6 to obtain the expected speed if 5 grams had
been used.

--------------------------------------------------
@conjuncts
Copy link
Owner

embed_tables uses the reading order provided by the PDF. At times, the PDF will report tables after the paragraphs, explaining the behavior here.

@ibrahim-jaafar
Copy link
Author

thanks for the answer. Is it possible to modify embed_tables logic to use bounding box coordinates instead of the order provided bt the pdf?

@conjuncts
Copy link
Owner

It's a bit tricky. You can always take the functions embed_tables_into_page and embed_tables from gmft._rich_text.rich_page.py and modify them for your own purposes, as the functions serve mostly as templates. (hence the underscore). For instance, pymupdf suggests sorting by y-coordinate, followed by x-coordinate, as a way of emulating the reading order.

You could modify embed_tables_into_page by writing something like:

words = list(page._get_positions_and_text_and_breaks())
words.sort(key=lambda x: (x[1], x[0]))
for x0, y0, x1, y1, word, blockno, lineno, wordno in words:
    # ...

If the reading order of the PDF is crucial, then I would suggest dedicated tools like marker, Amazon Textract, MinerU, etc. etc. that use deep learning for that purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants