This is a project designed for researchers to conveniently access papers they need.
A command line tool paper-downloader.py
is included, to automatically search and download paper
from Internet, with the name of the paper given.
The downloaded paper will thus have a readable file name
(I wrote it at the beginning because I'm tired of seeing the file name being random strings).
It mainly supports searching papers in computer science.
To run the command line tool, you'll need the following installed:
- requests (
pip install --user requests
) - BeautifulSoup4 (
pip install --user beautifulsoup4
) - termcolor (
pip install --user termcolor
) - pdftk command line executable.
- poppler-utils (optional)
Usage:
$ ./paper-downloader.py --help
$ ./paper-downloader.py "Distinctive image features from scale-invariant keypoints"
$ ./paper-downloader.py "http://arxiv.org/abs/1506.03184"
NOTE: If you are not in school, you may need proxy by environment variable http_proxy
and https_proxy
,
to be able to download from certain sites (such as 'dl.acm.org').
The searcher
module will fuzzy search and analyse results in
- Google Scholar
and the fetcher
module will further analyse the results and download papers from the following sources:
- direct pdf link
- dl.acm.org
- ieeexplore.ieee.org
- arxiv.org
Searcher
and Fetcher
are extensible to support more resources.
The command line tool will directly download the paper with a clean filename.
All the downloaded paper will be compressed using ps2pdf
from poppler-utils, if available.
The server provide:
- RESTful APIs on papers
- Interactive paper reading UI supported by pdf2htmlEX
Command line tool is sufficient to use. If you'd like to play with the server, you'll need:
- Python2 with virtualenv. Python headers are needed (python-dev on debian/ubuntu).
- ghostscript
- libcurl (libcurl4-{openssl,nss,gnutls}-dev on debian/ubuntu)
- xapian (libxapian-dev & python2-xapian on debian/ubuntu)
- pdf2htmlEx installed. See its download guide
- poppler-utils which provide the 'pdftotext' command line util
Note: if you need to run server on debian/ubuntu, make sure you do not have 'python2-bson' package installed.
- Fetcher dedup: when arxiv abs/pdf apperas both in search results, page would be downloaded twice (maybe add a cache for requests)
- Don't trust arxiv link from google scholar
- Is title correctly updated for dlacm?
- Extract title from bibtex -- more accurate?
- Fetcher for other sites