Specific HTML can cause `from_read` to hang indefinitely #54

scandox · 2022-02-24T20:10:58Z

While testing against a large batch of HTML samples I found that one appeared to cause an infinite loop when calling from_read.

Sample attached. I have tested this both with library call and with command line.
infinite.zip

I think it would be nice to be able to have some kind of limiting factor because HTML in the wild can be very weird and malformed.

The text was updated successfully, but these errors were encountered:

jugglerchris · 2022-02-24T21:28:03Z

Thanks for the report and test case!

jugglerchris · 2022-02-25T07:47:53Z

I've found the problem: is to do with how it tries to allocate columns widths when laying out deeply nested and wide tables. The column width estimate is ridiculously large and the strategy for shrinking to fit into the allowed width doesn't work well.

I think the full fix will tie in with changes needed for #48 and #53, but this is important enough to fix ASAP.

scandox · 2022-02-26T22:47:54Z

Here is an alternate sample which shows same behaviour. I assume the root cause is the same, but thought it might be of help.
sample2.zip

jugglerchris · 2022-02-27T07:41:15Z

Thanks! I'll check the fix with this test case as well.

jugglerchris · 2022-02-27T16:23:45Z

I believe I've just about fixed this branch bugfix_issue_54. Small tables look a bit tidier, but more importantly I've improved handling of larger tables and very nested tables-in-tables. One intended side effect is that for very large (wide) tables, cells will be on top of each other rather than next to each other.
I've got a couple more minor fixes to do before merging and releasing, but if you get a chance it'd be good to know if it works for you.

scandox · 2022-02-27T19:32:50Z

I got a panic on an index out of bounds with the attached. I have a pretty huge test set but this triggered pretty early.

sample3.zip

I think this is the relevant part of the stacktrace:

 6: html2text::render::text_renderer::BorderHoriz::join_below
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/render/text_renderer.rs:569:20
   7: html2text::render::text_renderer::BorderHoriz::merge_from_below
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/render/text_renderer.rs:583:21
   8: <html2text::render::text_renderer::TextRenderer<D> as html2text::render::Renderer>::append_columns_with_borders
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/render/text_renderer.rs:1067:29
   9: html2text::render_table_row::{{closure}}
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/lib.rs:1425:17
  10: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/alloc/src/boxed.rs:1868:9
  11: html2text::tree_map_reduce
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/lib.rs:740:27
  12: html2text::render_tree_to_string
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/lib.rs:1093:5
  13: html2text::RenderTree::render
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/lib.rs:1489:23
  14: html2text::from_read_with_decorator
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/lib.rs:1554:5
  15: html2text::from_read
             at /home/xxxx/.cargo/git/checkouts/rust-html2text-b3c655ea43e116a6/4718adf/src/lib.rs:1564:5

jugglerchris · 2022-02-27T21:47:10Z

Thanks for another test case! I've fixed the out of bounds panic.
There are still a couple of things that don't look right I need to fix.

jugglerchris · 2022-02-27T23:20:02Z

I've opened #55. The only thing I think would be nice to do here is to do something to show the difference between a vertical column of table cells and a horizontal row of cells which has been rendered vertically because of space issues (e.g. a border of ///// instead of ----- or similar).

scandox · 2022-02-28T11:34:30Z

I admire your commitment to the tables :). I know the README says "rendering reasonable HTML in a terminal" among your goals, but I'd like to also humbly submit a suggestion for my use case (indexing textual content): it would be very cool to have a mode that ignores the whole issue of layout and produces mere text.

jugglerchris · 2022-02-28T12:54:40Z

You're not the first to ask for that! See #53. I do plan to support that as well as Markdown-compliant output (#48) as both seem like useful things to do.

jugglerchris · 2022-03-06T09:49:25Z

I believe this is now fixed, and included in the 0.3.0 release. Thanks again!

jugglerchris added the bug label Feb 24, 2022

jugglerchris closed this as completed Mar 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specific HTML can cause `from_read` to hang indefinitely #54

Specific HTML can cause `from_read` to hang indefinitely #54

scandox commented Feb 24, 2022 •

edited

Loading

jugglerchris commented Feb 24, 2022

jugglerchris commented Feb 25, 2022

scandox commented Feb 26, 2022

jugglerchris commented Feb 27, 2022

jugglerchris commented Feb 27, 2022

scandox commented Feb 27, 2022 •

edited

Loading

jugglerchris commented Feb 27, 2022

jugglerchris commented Feb 27, 2022

scandox commented Feb 28, 2022

jugglerchris commented Feb 28, 2022

jugglerchris commented Mar 6, 2022

Specific HTML can cause from_read to hang indefinitely #54

Specific HTML can cause from_read to hang indefinitely #54

Comments

scandox commented Feb 24, 2022 • edited Loading

jugglerchris commented Feb 24, 2022

jugglerchris commented Feb 25, 2022

scandox commented Feb 26, 2022

jugglerchris commented Feb 27, 2022

jugglerchris commented Feb 27, 2022

scandox commented Feb 27, 2022 • edited Loading

jugglerchris commented Feb 27, 2022

jugglerchris commented Feb 27, 2022

scandox commented Feb 28, 2022

jugglerchris commented Feb 28, 2022

jugglerchris commented Mar 6, 2022

Specific HTML can cause `from_read` to hang indefinitely #54

Specific HTML can cause `from_read` to hang indefinitely #54

scandox commented Feb 24, 2022 •

edited

Loading

scandox commented Feb 27, 2022 •

edited

Loading