Collection of functions used in my dissertation, A Gospel of Health and Salvation.
Available sections of the module are:
- clean -- code for cleaning messy OCR
- models -- code creating topic modeling pipeline
- phrases -- collection of most common noun phrases in corpus
- preprocess -- prepare text for modeling with Mallet
- reports -- code for taking the data about the corpus and isolating particular elements
- utilities -- helper functions for executing the above tasks
To generate error rate statistics:
from text2topics import reports
reports.process_directory(directory, spelling_dictionary)
To create a spelling dictionary from text files:
from text2topics import utilities
utilities.create_spelling_dictionary(directory, wordlists)
wordlists is a list of file(s) containing the verified words and directory is the directory where those wordlist files reside. This function converts all words to lowercase and returns only the list of unique entries.
To install, navigate to the root directory of module (text2topics/) and run
pip install .
To update, run
pip install --upgrade .