Skip to content

ppwwyyxx/SoPaper

Repository files navigation

SoPaper, So Easy

This is a project designed for researchers to conveniently access papers they need.

A command line tool paper-downloader.py is included, to automatically search and download paper from Internet, with the name of the paper given. The downloaded paper will thus have a readable file name (I wrote it at the beginning because I'm tired of seeing the file name being random strings). It mainly supports searching papers in computer science.

How to Use

To run the command line tool, you'll need the following installed:

  • requests (pip install --user requests)
  • BeautifulSoup4 (pip install --user beautifulsoup4)
  • termcolor (pip install --user termcolor)
  • pdftk command line executable.
  • poppler-utils (optional)

Usage:

$ ./paper-downloader.py --help
$ ./paper-downloader.py "Distinctive image features from scale-invariant keypoints"
$ ./paper-downloader.py "http://arxiv.org/abs/1506.03184"

NOTE: If you are not in school, you may need proxy by environment variable http_proxy and https_proxy, to be able to download from certain sites (such as 'dl.acm.org').

Features

The searcher module will fuzzy search and analyse results in

  • Google Scholar
  • Google

and the fetcher module will further analyse the results and download papers from the following sources:

Searcher and Fetcher are extensible to support more resources.

The command line tool will directly download the paper with a clean filename. All the downloaded paper will be compressed using ps2pdf from poppler-utils, if available.

The server provide:

  • RESTful APIs on papers
  • Interactive paper reading UI supported by pdf2htmlEX

Command line tool is sufficient to use. If you'd like to play with the server, you'll need:

  • Python2 with virtualenv. Python headers are needed (python-dev on debian/ubuntu).
  • ghostscript
  • libcurl (libcurl4-{openssl,nss,gnutls}-dev on debian/ubuntu)
  • xapian (libxapian-dev & python2-xapian on debian/ubuntu)
  • pdf2htmlEx installed. See its download guide
  • poppler-utils which provide the 'pdftotext' command line util

Note: if you need to run server on debian/ubuntu, make sure you do not have 'python2-bson' package installed.

TODO

  • Fetcher dedup: when arxiv abs/pdf apperas both in search results, page would be downloaded twice (maybe add a cache for requests)
  • Don't trust arxiv link from google scholar
  • Is title correctly updated for dlacm?
  • Extract title from bibtex -- more accurate?
  • Fetcher for other sites