trying to extract text, not all strings are present - looks like those with non-latin characters are gone #102

Niedzwiedzw · 2021-09-23T14:49:53Z

I'm unable to provide an example pdf cause it contains sensitive data though :(

s3bk · 2021-09-23T18:54:55Z

@Niedzwiedzw which approach are you using?
I will try to give you instructions on how to get the relevant information without leaking the sensitive data tomorrow.

Niedzwiedzw · 2021-09-23T19:42:37Z

I've switched to master branch to be able to use named enum-style Ops, but now it doesn't load the document at all

thread 'parser::lotos::test_parser::test_example_files_parse' panicked at 'bad page?: 
 Try { file: "/home/niedzwiedz/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/9e56f00/pdf/src/file.rs", line: 94, column: 19, source: 
 Try { file: "/home/niedzwiedz/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/9e56f00/pdf/src/object/types.rs", line: 22, column: 42, 
 source: FromPrimitive { 
   typ: "Option < Content >", 
   field: "contents", 
   source: TryContext { file: "/home/niedzwiedz/.cargo
   
 /git/checkouts/pdf-3ef1c528a9b91eec/9e56f00/pdf/src/content.rs",
  line: 237, column: 21, context: [("op.as_str()", "Ok(\"BI\")")], 
  source: MissingEntry { typ: "InlineImage", field: "ColorSpace" } } } } }', 
  invoices/src/parser.rs:155:47

s3bk · 2021-09-23T19:44:39Z

oh wow. an inline image. Will look into that as well.

Niedzwiedzw · 2021-09-23T19:45:54Z

I'm not creating the documents, and I can imagine the standard compliance for pdf is a MESS. for some context, I'm trying to salvage what I can from some government generated documents :D

Niedzwiedzw · 2021-09-23T19:54:46Z

@s3bk https://github.com/sbeckeriv/lopdf/blob/master/src/nom_parser.rs would this be useful to you at all?

s3bk · 2021-09-24T11:51:14Z

I don't think we are going to switch to nom. It is great, but PDF is a mess and we already have a handwritten parser.

s3bk · 2021-09-25T08:15:44Z

The PDF Reference lists ColorSpaceas a non-optional field of inline images.
And I have no intention of allowing various derivations from the specification as that is a hole without bottom.

s3bk · 2021-09-25T08:34:41Z

@Niedzwiedzw you are in luck. The color_spacefield is an Option, so I went ahead and made it optional in inline images.

Niedzwiedzw · 2021-09-25T09:13:03Z

so cool thank you so much @s3bk

Niedzwiedzw · 2021-09-25T14:12:11Z

thread 'parser::lotos::test_parser::test_example_files_parse' panicked at 'bad page?:
 Try { file: "/home/niedzwiedz/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/d09d20e/pdf/src/file.rs", 
line: 94, column: 19, source: 
Try { file: "/home/niedzwiedz/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/d09d20e/pdf/src/object/types.rs", 
line: 22, column: 42, source: FromPrimitive { typ: "Option < Content >", field: "contents", source: 
TryContext { file: "/home/niedzwiedz/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/d09d20e/pdf/src/content.rs", line: 236, 
column: 21, context: [("op.as_str()", "Ok(\"BI\")")], source: MissingEntry { typ: "InlineImage", field: "Decode" } } } } }', 
invoices/src/parser.rs:155:47
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

hmm

mike-kfed mentioned this issue Nov 29, 2021

use new utf16be decode functions from pdf::fonts pdf-rs/pdf_tools#6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trying to extract text, not all strings are present - looks like those with non-latin characters are gone #102

trying to extract text, not all strings are present - looks like those with non-latin characters are gone #102

Niedzwiedzw commented Sep 23, 2021

s3bk commented Sep 23, 2021

Niedzwiedzw commented Sep 23, 2021 •

edited

Loading

s3bk commented Sep 23, 2021

Niedzwiedzw commented Sep 23, 2021

Niedzwiedzw commented Sep 23, 2021

s3bk commented Sep 24, 2021

s3bk commented Sep 25, 2021

s3bk commented Sep 25, 2021

Niedzwiedzw commented Sep 25, 2021

Niedzwiedzw commented Sep 25, 2021 •

edited

Loading

trying to extract text, not all strings are present - looks like those with non-latin characters are gone #102

trying to extract text, not all strings are present - looks like those with non-latin characters are gone #102

Comments

Niedzwiedzw commented Sep 23, 2021

s3bk commented Sep 23, 2021

Niedzwiedzw commented Sep 23, 2021 • edited Loading

s3bk commented Sep 23, 2021

Niedzwiedzw commented Sep 23, 2021

Niedzwiedzw commented Sep 23, 2021

s3bk commented Sep 24, 2021

s3bk commented Sep 25, 2021

s3bk commented Sep 25, 2021

Niedzwiedzw commented Sep 25, 2021

Niedzwiedzw commented Sep 25, 2021 • edited Loading

Niedzwiedzw commented Sep 23, 2021 •

edited

Loading

Niedzwiedzw commented Sep 25, 2021 •

edited

Loading