This is a heavily overhauled version of this library (GitHub repo). New features include:
- configuration files
- ability to set topic model hyperparameters
- major computational speedups
- additional assessment metrics for choosing an appropriate number of topics
- normalized topic loadings
- new charts
- interactive charts
- access to raw data
- a topic loading similarity browser app
TOM (TOpic Modeling) is a Python 3 library for topic modeling and browsing, licensed under the MIT license. Its objective is to allow for an efficient analysis of a text corpus from start to finish, via the discovery of latent topics. To this end, TOM features functions for preparing and vectorizing a text corpus, though you may want to perform additional preprocessing steps on the corpus before topic modeling. It also offers a common interface for two topic models (LDA using either variational inference or Gibbs sampling, and NMF using alternating least squares with a projected gradient method), and implements five state-of-the-art methods for estimating the optimal number of topics to model a corpus. TOM constructs an interactive web browser-based application that makes it easy to explore a topic model and the related corpus.
- Feature selection based on word frequency
- Via the parameters
max_relative_frequency
,min_absolute_frequency
, andmax_features
- Via the parameters
- Weighting
- tf-idf (for NMF)
- tf (for LDA)
- Latent Dirichlet Allocation
- Standard variational Bayesian inference (Latent Dirichlet Allocation. Blei et al., 2003)
- Online variational Bayesian inference (Online learning for Latent Dirichlet Allocation. Hoffman et al., 2010)
- Collapsed Gibbs sampling (Finding scientific topics. Griffiths & Steyvers, 2004)
- Non-negative Matrix Factorization (NMF)
- Alternating least-square with a projected gradient method (Projected gradient methods for non-negative matrix factorization. Lin, 2007)
- Stability analysis (How Many Topics? Stability Analysis for Topic Models. Greene et al, 2014)
- Spectral analysis (On finding the natural number of topics with Latent Dirichlet Allocation: Some observations. Arun et al., 2010)
- Consensus-based analysis (Metagenes and molecular pattern discovery using matrix factorization. Brunet et al., 2004)
- Word2Vec-based coherence metric
- Perplexity (LDA only; Measures perplexity for LDA as computed by scikit-learn.)
-
Clone this repo:
git clone [email protected]:warmlogic/TOM.git
-
In a terminal,
cd
to theTOM
directory -
Run the following command to install Miniconda (Python 3) and the required libraries (installed in the
base
conda environment):./python_env_setup.sh
-
Log out and back in to ensure the
base
conda environment is active
List of the installed libraries:
- Plotly and Dash
- gensim
- lda
- matplotlib
- networkx
- nltk
- numpy
- openpyxl
- pandas
- scikit-learn
- scipy
- seaborn
- smart_open
The provided scripts use the parameters defined in the configuration file. Copy the config_template.ini
file to config.ini
and set it up as desired.
In order of importance, the scripts are:
assess_topics.py
: Produce artifacts used for estimating the optimal number of topicsbuild_topic_model_browser.py
: Run a local web server and generate a web browser-based application for exploring the topic model and corpusinfer_topics.py
: Simply train and save topic models for a range of numbers of topics
Run a script in the terminal using the following command structure:
python <script name> --config_filepath=<config file name>
A corpus is a TSV (tab separated values) file describing documents. This is formatted as one document per line, with the following columns:
id
: a unique identifieraffiliation
: for grouping documents within a datasetdataset
: used when combining documents from various sourcestitle
author
date
: preferably formatted asYYYY-MM-DD
text
: the text on which to train the topic model, which may be preprocessed in various waysorig_text
: the original text of the document (optional; if absent, will usetext
column)
id affiliation dataset title author date text orig_text
doc1 journal1 dataset1 Document 1's title Author 1 2019-01-01 full content document 1 Full content of document 1.
doc2 journal2 dataset1 Document 2's title Author 2 2019-05-01 full content document 2 Full content of document 2.
etc.
The following code snippets get run in the provided scripts. They are shown below to demonstrate how to interact with them. You'll need to import the required classes as follows:
from tom_lib.structure.corpus import Corpus
from tom_lib.nlp.topic_model import NonNegativeMatrixFactorization, LatentDirichletAllocation
from tom_lib.visualization.visualization import Visualization
The following code snippet shows how to load a corpus vectorize them using tf-idf with unigrams.
corpus = Corpus(
source_filepath='input/raw_corpus.tsv',
vectorization='tfidf',
n_gram=1,
max_relative_frequency=0.8,
min_absolute_frequency=4,
)
print(f'Corpus size: {corpus.size:,}')
print(f'Vocabulary size: {corpus.vocabulary_size:,}')
print('Vector representation of document 0:\n', corpus.word_vector_for_document(doc_id=0))
It is possible to instantiate a NMF or LDA object then infer topics.
NMF (use vectorization='tfidf'
when creating the corpus):
topic_model = NonNegativeMatrixFactorization(corpus)
topic_model.infer_topics(num_topics=15)
LDA (using either the standard variational Bayesian inference or Gibbs sampling; use vectorization='tf'
when creating the corpus):
topic_model = LatentDirichletAllocation(corpus)
topic_model.infer_topics(num_topics=15, algorithm='variational')
topic_model = LatentDirichletAllocation(corpus)
topic_model.infer_topics(num_topics=15, algorithm='gibbs')
Here we instantiate a NMF object, then generate plots with four metrics for estimating the optimal number of topics. Estimating the optimal number of topics may take a long time with a large corpus.
topic_model = NonNegativeMatrixFactorization(corpus)
viz = Visualization(topic_model)
viz.plot_greene_metric(
min_num_topics=5,
max_num_topics=50,
step=1,
tao=10,
top_n_words=10,
)
viz.plot_arun_metric(
min_num_topics=5,
max_num_topics=50,
step=1,
iterations=10,
)
viz.plot_brunet_metric(
min_num_topics=5,
max_num_topics=50,
step=1,
iterations=10,
)
viz.plot_coherence_w2v_metric(
min_num_topics=5,
max_num_topics=50,
step=1,
top_n_words=10,
)
# # LDA only
# viz.plot_perplexity_metric(
# min_num_topics=5,
# max_num_topics=50,
# step=1,
# )
To allow reusing previously learned topics models, TOM can save them on disk, as shown below.
import tom_lib.utils as ut
ut.save_topic_model(topic_model, 'output/NMF_15topics.pickle')
topic_model = ut.load_topic_model('output/NMF_15topics.pickle')
This code excerpt illustrates how one can manipulate a topic model, e.g. get the topic distribution for a document or the word distribution for a topic.
print('\nTopics:')
topic_model.print_topics(num_words=10)
print('\nTopic distribution for document 0:',
topic_model.topic_distribution_for_document(0))
print('\nMost likely topic for document 0:',
topic_model.most_likely_topic_for_document(0))
print('\nFrequency of topics:',
topic_model.topics_frequency())
print('\nTop 10 most relevant words for topic 2:',
topic_model.top_words(2, 10))
-
Create a compute instance using a service like AWS EC2 or GCP
- Handy repo for easily provisioning an AWS EC2 instance
- The instance size you choose depends on the size of your corpus and topic model parameters. Training a model will take significantly more RAM and CPU than simply hosting the app with a pre-trained model. It's possible to train the model on your own computer and
rsync
the files to the instance for running the app.- RAM suggestion for training a new model: 8 GB (or more)
- RAM suggestion for running the app with a pre-trained model: 2 GB
-
If using AWS, edit the Security Group inbound rules and create a custom TCP rule to allow inbound traffic on port 80.
- This is the port that someone will use to access the web app from the outside world.
-
SSH into the instance
-
Install a web server:
sudo apt-get install nginx
-
Edit the file
/etc/nginx/sites-enabled/default
to contain only the following text.- If you are using a custom port for accessing the web app, change the
listen
port to that value - If you changed changed the port in
config.ini
(default is 5000), update theproxy_pass
port to be that value. This is the port on which the Flask app will run.
server { listen 80; location / { include proxy_params; proxy_pass http://127.0.0.1:5000; } }
- If you are using a custom port for accessing the web app, change the
-
Restart the web server:
sudo service nginx restart
NB: If you want to run more than one web server at a time, you can simply add additional copies of the above parameters with different ports, and add corresponding Security Group inbound rules as described above.
- Start a multiplexer session (
tmux new -s webserver
) - Run the web app:
python build_topic_model_browser.py --config_filepath=config.ini
- Visit the instance's public IP address to verify it works. Watch for corresponding logs in the terminal.
- Exit the multiplexer session:
ctrl-b d
- To reattach to the session:
tmux attach -t webserver
- To shut down the app:
ctrl-c
- Remember that you'll be charged for the instance as long as it is running. You can use the AWS EC2 console to stop the instance to stop being charged, as well as restarting it to pick up where you left off.