Simple Keyword Analysis

Provides a simple extraction and analysis of the most commonly used words in pdf or txt files, using the Python Natural Language Toolkit.

If you have more complex text extraction needs, you may want to take a look at the Doc Processing Toolkit.

Installation

First, download the repo: git clone https://github.com/18F/text-analysis.git

We recommend using pipenv to install dependencies and run things safely in a virtualenv. You'll set that up by running pipenv install from within the repo.

Your virtualenv should be using Python 3.x. If it's not, try brew install python and hopefully you'll get it sorted out. Remember, after you have python 3.x installed, you'll need to re-run pipenv install

If you don't have pipenv, you should be able to install it by running brew install pipenv. Check the Pipenv documentation for details.

Usage

First, drop the files you want to analyze into the files directory.

Then activate your virtual environment: pipenv shell

If this is your first time running this, or if you haven't used it in a long time, be sure the NLTK modules are up-to-date by running python update_nltk.py

Then run python keyword_analysis.py

Dependencies

These should all be installed for you when you run pip install but if you're curious about what's happening under the hood:

PyPDF2 is used to read PDF files. NLTK handles the textual analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Simple Keyword Analysis

Installation

Usage

Dependencies

Files

README.md

Latest commit

History

README.md

File metadata and controls

Simple Keyword Analysis

Installation

Usage

Dependencies