Provides a simple extraction and analysis of the most commonly used words in pdf
or txt
files, using
the Python Natural Language Toolkit.
If you have more complex text extraction needs, you may want to take a look at the Doc Processing Toolkit.
First, download the repo:
git clone https://github.com/18F/text-analysis.git
We recommend using pipenv
to install dependencies and run things safely in a virtualenv.
You'll set that up by running
pipenv install
from within the repo.
Your virtualenv should be using Python 3.x. If it's not, try brew install python
and hopefully you'll get it sorted out. Remember, after you have python 3.x installed,
you'll need to re-run pipenv install
If you don't have pipenv
, you should be able to install it by running brew install pipenv
.
Check the Pipenv documentation for details.
First, drop the files you want to analyze into the files
directory.
Then activate your virtual environment:
pipenv shell
If this is your first time running this, or if you haven't used it in a long time,
be sure the NLTK modules are up-to-date by running
python update_nltk.py
Then run
python keyword_analysis.py
These should all be installed for you when you run pip install
but if you're curious about
what's happening under the hood:
PyPDF2 is used to read PDF files. NLTK handles the textual analysis.