Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'papers extract' results in a call with nonsensical arguments to pdftotext #63

Open
andr-agus opened this issue Nov 14, 2023 · 2 comments

Comments

@andr-agus
Copy link

Hi,

I've been using 'papers' for quite a while now and this is the first time I've seen this issue. I am trying to extract the bilbiographic info of this article* from its pdf. The program throws this exception:

Command Line Error: Wrong page range given: the first page (2) can not be after the last page (1).
Traceback (most recent call last):
File "/usr/bin/papers", line 8, in
sys.exit(main())
^^^^^^
File "/usr/lib/python3.11/site-packages/papers/main.py", line 1091, in main
extractcmd(subp, o)
File "/usr/lib/python3.11/site-packages/papers/main.py", line 546, in extractcmd
print(extract_pdf_metadata(o.pdf, search_doi=not o.fulltext, search_fulltext=True, scholar=o.scholar, minwords=o.word_count, max_query_words=o.word_count, image=o.image))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/papers/extract.py", line 208, in extract_pdf_metadata
txt = pdfhead(pdf, maxpages, minwords, image=image)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/papers/extract.py", line 134, in pdfhead
txt += readpdf(pdf, first=i, last=i)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/papers/extract.py", line 41, in readpdf
sp.check_call(cmd)
File "/usr/lib/python3.11/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['pdftotext', '-f', '2', '-l', '2', 'paper.pdf', '/tmp/tmpaq14gv_5.txt']' returned non-zero exit status 99.

Apparently, 'papers' is calling 'pdftotext' with arguments that make no sense, so, what is making 'papers' get confused about those arguments?

(Have I mentioned how much I like this program? Cheers!)

*https://www.nature.com/articles/s41567-020-0990-x

@perrette
Copy link
Owner

Hi @andr-agus,

may I ask which version of papers, pdftotext, operating system etc. you use?
For me the paper you link works fine.

> pip install -U papers-cli
...
> papers --version
2.4
> pdftotext -h
pdftotext version 22.02.0
...
> papers extract s41567-020-0990-x.pdf
@article{Bong_2020,
	doi = {10.1038/s41567-020-0990-x},
	url = {https://doi.org/10.1038%2Fs41567-020-0990-x},
	year = 2020,
	month = {aug},
	publisher = {Springer Science and Business Media {LLC}},
	volume = {16},
	number = {12},
	pages = {1199--1205},
	author = {Kok-Wei Bong and An{\'{\i}}bal Utreras-Alarc{\'{o}}n and Farzad Ghafari and Yeong-Cherng Liang and Nora Tischler and Eric G. Cavalcanti and Geoff J. Pryde and Howard M. Wiseman},
	title = {A strong no-go theorem on the Wigner's friend paradox},
	journal = {Nature Physics}
}

Thanks for the good vibes.
Mahé

@perrette
Copy link
Owner

PS:

> papers extract s41567-020-0990-x.pdf --debug
DEBUG:papers:read pdf page: 1
INFO:papers:pdftotext -f 1 -l 1 s41567-020-0990-x.pdf /tmp/tmp_fgh87__.txt
...
> pdftotext -f 1 -l 1 s41567-020-0990-x.pdf out1.txt  
... all fine ...
> pdftotext -f 2 -l 2 s41567-020-0990-x.pdf out2.txt
... all fine ... (this is the command from your log)

So I assume the issue is with your version of pdftotext. Is it too old or too new or ???

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants