Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specific HTML can cause from_read to hang indefinitely #54

Closed
scandox opened this issue Feb 24, 2022 · 11 comments
Closed

Specific HTML can cause from_read to hang indefinitely #54

scandox opened this issue Feb 24, 2022 · 11 comments
Labels

Comments

@scandox
Copy link

scandox commented Feb 24, 2022

While testing against a large batch of HTML samples I found that one appeared to cause an infinite loop when calling from_read.

Sample attached. I have tested this both with library call and with command line.
infinite.zip

I think it would be nice to be able to have some kind of limiting factor because HTML in the wild can be very weird and malformed.

@jugglerchris
Copy link
Owner

Thanks for the report and test case!

@jugglerchris
Copy link
Owner

I've found the problem: is to do with how it tries to allocate columns widths when laying out deeply nested and wide tables. The column width estimate is ridiculously large and the strategy for shrinking to fit into the allowed width doesn't work well.

I think the full fix will tie in with changes needed for #48 and #53, but this is important enough to fix ASAP.

@scandox
Copy link
Author

scandox commented Feb 26, 2022

Here is an alternate sample which shows same behaviour. I assume the root cause is the same, but thought it might be of help.
sample2.zip

@jugglerchris
Copy link
Owner

Thanks! I'll check the fix with this test case as well.

@jugglerchris
Copy link
Owner

I believe I've just about fixed this branch bugfix_issue_54. Small tables look a bit tidier, but more importantly I've improved handling of larger tables and very nested tables-in-tables. One intended side effect is that for very large (wide) tables, cells will be on top of each other rather than next to each other.
I've got a couple more minor fixes to do before merging and releasing, but if you get a chance it'd be good to know if it works for you.

@scandox
Copy link
Author

scandox commented Feb 27, 2022

I got a panic on an index out of bounds with the attached. I have a pretty huge test set but this triggered pretty early.

sample3.zip

I think this is the relevant part of the stacktrace:

 6: html2text::render::text_renderer::BorderHoriz::join_below
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/render/text_renderer.rs:569:20
   7: html2text::render::text_renderer::BorderHoriz::merge_from_below
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/render/text_renderer.rs:583:21
   8: <html2text::render::text_renderer::TextRenderer<D> as html2text::render::Renderer>::append_columns_with_borders
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/render/text_renderer.rs:1067:29
   9: html2text::render_table_row::{{closure}}
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/lib.rs:1425:17
  10: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/boxed.rs:1868:9
  11: html2text::tree_map_reduce
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/lib.rs:740:27
  12: html2text::render_tree_to_string
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/lib.rs:1093:5
  13: html2text::RenderTree::render
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/lib.rs:1489:23
  14: html2text::from_read_with_decorator
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/lib.rs:1554:5
  15: html2text::from_read
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/lib.rs:1564:5

@jugglerchris
Copy link
Owner

Thanks for another test case! I've fixed the out of bounds panic.
There are still a couple of things that don't look right I need to fix.

@jugglerchris
Copy link
Owner

I've opened #55. The only thing I think would be nice to do here is to do something to show the difference between a vertical column of table cells and a horizontal row of cells which has been rendered vertically because of space issues (e.g. a border of ///// instead of ----- or similar).

@scandox
Copy link
Author

scandox commented Feb 28, 2022

I admire your commitment to the tables :). I know the README says "rendering reasonable HTML in a terminal" among your goals, but I'd like to also humbly submit a suggestion for my use case (indexing textual content): it would be very cool to have a mode that ignores the whole issue of layout and produces mere text.

@jugglerchris
Copy link
Owner

You're not the first to ask for that! See #53. I do plan to support that as well as Markdown-compliant output (#48) as both seem like useful things to do.

@jugglerchris
Copy link
Owner

I believe this is now fixed, and included in the 0.3.0 release. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants