'papers extract' results in a call with nonsensical arguments to pdftotext #63

andr-agus · 2023-11-14T16:16:21Z

Hi,

I've been using 'papers' for quite a while now and this is the first time I've seen this issue. I am trying to extract the bilbiographic info of this article* from its pdf. The program throws this exception:

Command Line Error: Wrong page range given: the first page (2) can not be after the last page (1).
Traceback (most recent call last):
File "/usr/bin/papers", line 8, in
sys.exit(main())
^^^^^^
File "/usr/lib/python3.11/site-packages/papers/main.py", line 1091, in main
extractcmd(subp, o)
File "/usr/lib/python3.11/site-packages/papers/main.py", line 546, in extractcmd
print(extract_pdf_metadata(o.pdf, search_doi=not o.fulltext, search_fulltext=True, scholar=o.scholar, minwords=o.word_count, max_query_words=o.word_count, image=o.image))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/papers/extract.py", line 208, in extract_pdf_metadata
txt = pdfhead(pdf, maxpages, minwords, image=image)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/papers/extract.py", line 134, in pdfhead
txt += readpdf(pdf, first=i, last=i)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/papers/extract.py", line 41, in readpdf
sp.check_call(cmd)
File "/usr/lib/python3.11/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['pdftotext', '-f', '2', '-l', '2', 'paper.pdf', '/tmp/tmpaq14gv_5.txt']' returned non-zero exit status 99.

Apparently, 'papers' is calling 'pdftotext' with arguments that make no sense, so, what is making 'papers' get confused about those arguments?

(Have I mentioned how much I like this program? Cheers!)

*https://www.nature.com/articles/s41567-020-0990-x

perrette · 2023-11-15T14:42:25Z

Hi @andr-agus,

may I ask which version of papers, pdftotext, operating system etc. you use?
For me the paper you link works fine.

> pip install -U papers-cli
...
> papers --version
2.4
> pdftotext -h
pdftotext version 22.02.0
...
> papers extract s41567-020-0990-x.pdf
@article{Bong_2020,
	doi = {10.1038/s41567-020-0990-x},
	url = {https://doi.org/10.1038%2Fs41567-020-0990-x},
	year = 2020,
	month = {aug},
	publisher = {Springer Science and Business Media {LLC}},
	volume = {16},
	number = {12},
	pages = {1199--1205},
	author = {Kok-Wei Bong and An{\'{\i}}bal Utreras-Alarc{\'{o}}n and Farzad Ghafari and Yeong-Cherng Liang and Nora Tischler and Eric G. Cavalcanti and Geoff J. Pryde and Howard M. Wiseman},
	title = {A strong no-go theorem on the Wigner's friend paradox},
	journal = {Nature Physics}
}

Thanks for the good vibes.
Mahé

perrette · 2023-11-15T14:47:19Z

PS:

> papers extract s41567-020-0990-x.pdf --debug
DEBUG:papers:read pdf page: 1
INFO:papers:pdftotext -f 1 -l 1 s41567-020-0990-x.pdf /tmp/tmp_fgh87__.txt
...
> pdftotext -f 1 -l 1 s41567-020-0990-x.pdf out1.txt  
... all fine ...
> pdftotext -f 2 -l 2 s41567-020-0990-x.pdf out2.txt
... all fine ... (this is the command from your log)

So I assume the issue is with your version of pdftotext. Is it too old or too new or ???

perrette added the not reproducible label Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'papers extract' results in a call with nonsensical arguments to pdftotext #63

'papers extract' results in a call with nonsensical arguments to pdftotext #63

andr-agus commented Nov 14, 2023

perrette commented Nov 15, 2023

perrette commented Nov 15, 2023

'papers extract' results in a call with nonsensical arguments to pdftotext #63

'papers extract' results in a call with nonsensical arguments to pdftotext #63

Comments

andr-agus commented Nov 14, 2023

perrette commented Nov 15, 2023

perrette commented Nov 15, 2023