Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trying to extract text, not all strings are present - looks like those with non-latin characters are gone #102

Open
Niedzwiedzw opened this issue Sep 23, 2021 · 10 comments

Comments

@Niedzwiedzw
Copy link
Contributor

I'm unable to provide an example pdf cause it contains sensitive data though :(

@s3bk
Copy link
Contributor

s3bk commented Sep 23, 2021

@Niedzwiedzw which approach are you using?
I will try to give you instructions on how to get the relevant information without leaking the sensitive data tomorrow.

@Niedzwiedzw
Copy link
Contributor Author

Niedzwiedzw commented Sep 23, 2021

I've switched to master branch to be able to use named enum-style Ops, but now it doesn't load the document at all

thread 'parser::lotos::test_parser::test_example_files_parse' panicked at 'bad page?: 
 Try { file: "/home/niedzwiedz/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/9e56f00/pdf/src/file.rs", line: 94, column: 19, source: 
 Try { file: "/home/niedzwiedz/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/9e56f00/pdf/src/object/types.rs", line: 22, column: 42, 
 source: FromPrimitive { 
   typ: "Option < Content >", 
   field: "contents", 
   source: TryContext { file: "/home/niedzwiedz/.cargo
   
 /git/checkouts/pdf-3ef1c528a9b91eec/9e56f00/pdf/src/content.rs",
  line: 237, column: 21, context: [("op.as_str()", "Ok(\"BI\")")], 
  source: MissingEntry { typ: "InlineImage", field: "ColorSpace" } } } } }', 
  invoices/src/parser.rs:155:47

@s3bk
Copy link
Contributor

s3bk commented Sep 23, 2021

oh wow. an inline image. Will look into that as well.

@Niedzwiedzw
Copy link
Contributor Author

I'm not creating the documents, and I can imagine the standard compliance for pdf is a MESS. for some context, I'm trying to salvage what I can from some government generated documents :D

@Niedzwiedzw
Copy link
Contributor Author

@s3bk https://github.com/sbeckeriv/lopdf/blob/master/src/nom_parser.rs would this be useful to you at all?

@s3bk
Copy link
Contributor

s3bk commented Sep 24, 2021

I don't think we are going to switch to nom. It is great, but PDF is a mess and we already have a handwritten parser.

@s3bk
Copy link
Contributor

s3bk commented Sep 25, 2021

The PDF Reference lists ColorSpaceas a non-optional field of inline images.
And I have no intention of allowing various derivations from the specification as that is a hole without bottom.

@s3bk
Copy link
Contributor

s3bk commented Sep 25, 2021

@Niedzwiedzw you are in luck. The color_spacefield is an Option, so I went ahead and made it optional in inline images.

@Niedzwiedzw
Copy link
Contributor Author

so cool thank you so much @s3bk

@Niedzwiedzw
Copy link
Contributor Author

Niedzwiedzw commented Sep 25, 2021

thread 'parser::lotos::test_parser::test_example_files_parse' panicked at 'bad page?:
 Try { file: "/home/niedzwiedz/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/d09d20e/pdf/src/file.rs", 
line: 94, column: 19, source: 
Try { file: "/home/niedzwiedz/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/d09d20e/pdf/src/object/types.rs", 
line: 22, column: 42, source: FromPrimitive { typ: "Option < Content >", field: "contents", source: 
TryContext { file: "/home/niedzwiedz/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/d09d20e/pdf/src/content.rs", line: 236, 
column: 21, context: [("op.as_str()", "Ok(\"BI\")")], source: MissingEntry { typ: "InlineImage", field: "Decode" } } } } }', 
invoices/src/parser.rs:155:47
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

hmm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants