Skip to content
This repository has been archived by the owner on Jan 6, 2025. It is now read-only.

Camelot speed up #427

Open
majd1239 opened this issue Jun 18, 2020 · 4 comments
Open

Camelot speed up #427

majd1239 opened this issue Jun 18, 2020 · 4 comments

Comments

@majd1239
Copy link

majd1239 commented Jun 18, 2020

While using camelot to extract tables from pdfs. I noticed it's really slow. I profiled the code and turns out that %60 of the bottleneck is from np.isclose here and here as well as multiple other places in core.py:

if np.isclose(te.x, x_coord, atol=0.5):

if np.isclose(self.y0, y0, atol=edge_tol):

The slowdown makes sense since there is a very big overhead with np.isclose if we are dealing with native python floats instead of numpy types.

I switched the method to math.isclose instead and the processing time was reduced to more than half!

I can submit an Pull Request with the changes if the devs agree this is a safe change to make.

Thanks

@kamil-kierski
Copy link

I've checked your suggestion and studied through the code and looks like its only relevant for "Stream" flavor, with Lattice - the provided code snippets are not even used ;(

@FrancoisHuet
Copy link

I've noticed a similar issue. np.isclose is slow, but one thing that doesn't help is that:

  • TextEdges.generate calls TextEdges.update n (number of textlines in the doc) times (code)
    • TextEdges.update calls TextEdges.find 3 (number of alignment dimensions) times (code)
      • TextEdges.find does a linear search across _textedges (~n of them so on average 0.5 n calls) using isclose to find elements (code),

=> so that's O(n^2) calls to isclose.

While refactoring stream and network, I've changed this a bit by doing a binary search (method get_index_closest_point, log(n) complexity), before calling isclose. This makes overall the code faster, and reduces calls to isclose drastically.
Code change is here. In my local tests this made the unit test suite faster by ~20% (from 85s to 69s), with the image generation becoming the next bottleneck.

Bottom line: once the Pull Request for the new hybrid network is merged, this might no longer be needed.

@sttobia
Copy link

sttobia commented Jun 8, 2021

Is there any update on that? I am currently also using stream which takes very long. The proposal of @majd1239 sounds very interesting, also if it just improves the speed for stream.

@dss010101
Copy link

update?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants