TOM

This is a heavily overhauled version of this library (GitHub repo). New features include:

configuration files
ability to set topic model hyperparameters
major computational speedups
additional assessment metrics for choosing an appropriate number of topics
normalized topic loadings
new charts
interactive charts
access to raw data
a topic loading similarity browser app

TOM (TOpic Modeling) is a Python 3 library for topic modeling and browsing, licensed under the MIT license. Its objective is to allow for an efficient analysis of a text corpus from start to finish, via the discovery of latent topics. To this end, TOM features functions for preparing and vectorizing a text corpus, though you may want to perform additional preprocessing steps on the corpus before topic modeling. It also offers a common interface for two topic models (LDA using either variational inference or Gibbs sampling, and NMF using alternating least squares with a projected gradient method), and implements five state-of-the-art methods for estimating the optimal number of topics to model a corpus. TOM constructs an interactive web browser-based application that makes it easy to explore a topic model and the related corpus.

Features

Vector space modeling

Feature selection based on word frequency
- Via the parameters max_relative_frequency, min_absolute_frequency, and max_features
Weighting
- tf-idf (for NMF)
- tf (for LDA)

Topic modeling

Latent Dirichlet Allocation
- Standard variational Bayesian inference (Latent Dirichlet Allocation. Blei et al., 2003)
- Online variational Bayesian inference (Online learning for Latent Dirichlet Allocation. Hoffman et al., 2010)
- Collapsed Gibbs sampling (Finding scientific topics. Griffiths & Steyvers, 2004)
Non-negative Matrix Factorization (NMF)
- Alternating least-square with a projected gradient method (Projected gradient methods for non-negative matrix factorization. Lin, 2007)

Estimating the optimal number of topics

Stability analysis (How Many Topics? Stability Analysis for Topic Models. Greene et al, 2014)
Spectral analysis (On finding the natural number of topics with Latent Dirichlet Allocation: Some observations. Arun et al., 2010)
Consensus-based analysis (Metagenes and molecular pattern discovery using matrix factorization. Brunet et al., 2004)
Word2Vec-based coherence metric
Perplexity (LDA only; Measures perplexity for LDA as computed by scikit-learn.)

Installation

Clone this repo: git clone [email protected]:warmlogic/TOM.git
In a terminal, cd to the TOM directory
Run the following command to install Miniconda (Python 3) and the required libraries (installed in the base conda environment):
```
./python_env_setup.sh
```
Log out and back in to ensure the base conda environment is active

List of the installed libraries:

Provided scripts

The provided scripts use the parameters defined in the configuration file. Copy the config_template.ini file to config.ini and set it up as desired.

In order of importance, the scripts are:

assess_topics.py: Produce artifacts used for estimating the optimal number of topics
build_topic_model_browser.py: Run a local web server and generate a web browser-based application for exploring the topic model and corpus
infer_topics.py: Simply train and save topic models for a range of numbers of topics

Run a script in the terminal using the following command structure:

python <script name> --config_filepath=<config file name>

Expected corpus format

A corpus is a TSV (tab separated values) file describing documents. This is formatted as one document per line, with the following columns:

id: a unique identifier
affiliation: for grouping documents within a dataset
dataset: used when combining documents from various sources
title
author
date: preferably formatted as YYYY-MM-DD
text: the text on which to train the topic model, which may be preprocessed in various ways
orig_text: the original text of the document (optional; if absent, will use text column)

id	affiliation	dataset	title	author	date	text	orig_text
doc1	journal1	dataset1	Document 1's title	Author 1	2019-01-01	full content document 1	Full content of document 1.
doc2	journal2	dataset1	Document 2's title	Author 2	2019-05-01	full content document 2	Full content of document 2.
etc.

Interactive usage

The following code snippets get run in the provided scripts. They are shown below to demonstrate how to interact with them. You'll need to import the required classes as follows:

from tom_lib.structure.corpus import Corpus
from tom_lib.nlp.topic_model import NonNegativeMatrixFactorization, LatentDirichletAllocation
from tom_lib.visualization.visualization import Visualization

Load and prepare a corpus

The following code snippet shows how to load a corpus vectorize them using tf-idf with unigrams.

corpus = Corpus(
    source_filepath='input/raw_corpus.tsv',
    vectorization='tfidf',
    n_gram=1,
    max_relative_frequency=0.8,
    min_absolute_frequency=4,
)
print(f'Corpus size: {corpus.size:,}')
print(f'Vocabulary size: {corpus.vocabulary_size:,}')
print('Vector representation of document 0:\n', corpus.word_vector_for_document(doc_id=0))

Instantiate a topic model and infer topics

It is possible to instantiate a NMF or LDA object then infer topics.

NMF (use vectorization='tfidf' when creating the corpus):

topic_model = NonNegativeMatrixFactorization(corpus)
topic_model.infer_topics(num_topics=15)

LDA (using either the standard variational Bayesian inference or Gibbs sampling; use vectorization='tf' when creating the corpus):

topic_model = LatentDirichletAllocation(corpus)
topic_model.infer_topics(num_topics=15, algorithm='variational')

topic_model = LatentDirichletAllocation(corpus)
topic_model.infer_topics(num_topics=15, algorithm='gibbs')

Instantiate a topic model and estimate the optimal number of topics

Here we instantiate a NMF object, then generate plots with four metrics for estimating the optimal number of topics. Estimating the optimal number of topics may take a long time with a large corpus.

topic_model = NonNegativeMatrixFactorization(corpus)

viz = Visualization(topic_model)

viz.plot_greene_metric(
    min_num_topics=5,
    max_num_topics=50,
    step=1,
    tao=10,
    top_n_words=10,
)

viz.plot_arun_metric(
    min_num_topics=5,
    max_num_topics=50,
    step=1,
    iterations=10,
)

viz.plot_brunet_metric(
    min_num_topics=5,
    max_num_topics=50,
    step=1,
    iterations=10,
)

viz.plot_coherence_w2v_metric(
    min_num_topics=5,
    max_num_topics=50,
    step=1,
    top_n_words=10,
)

# # LDA only
# viz.plot_perplexity_metric(
#     min_num_topics=5,
#     max_num_topics=50,
#     step=1,
# )

Save/load a topic model

To allow reusing previously learned topics models, TOM can save them on disk, as shown below.

import tom_lib.utils as ut
ut.save_topic_model(topic_model, 'output/NMF_15topics.pickle')
topic_model = ut.load_topic_model('output/NMF_15topics.pickle')

Print information about a topic model

This code excerpt illustrates how one can manipulate a topic model, e.g. get the topic distribution for a document or the word distribution for a topic.

print('\nTopics:')
topic_model.print_topics(num_words=10)
print('\nTopic distribution for document 0:',
      topic_model.topic_distribution_for_document(0))
print('\nMost likely topic for document 0:',
      topic_model.most_likely_topic_for_document(0))
print('\nFrequency of topics:',
      topic_model.topics_frequency())
print('\nTop 10 most relevant words for topic 2:',
      topic_model.top_words(2, 10))

Run the browser app on a remote web server

Setup

Create a compute instance using a service like AWS EC2 or GCP
1. Handy repo for easily provisioning an AWS EC2 instance
2. The instance size you choose depends on the size of your corpus and topic model parameters. Training a model will take significantly more RAM and CPU than simply hosting the app with a pre-trained model. It's possible to train the model on your own computer and rsync the files to the instance for running the app.
  1. RAM suggestion for training a new model: 8 GB (or more)
  2. RAM suggestion for running the app with a pre-trained model: 2 GB
If using AWS, edit the Security Group inbound rules and create a custom TCP rule to allow inbound traffic on port 80.
1. This is the port that someone will use to access the web app from the outside world.
SSH into the instance
Install a web server:
```
sudo apt-get install nginx
```
Edit the file /etc/nginx/sites-enabled/default to contain only the following text.
1. If you are using a custom port for accessing the web app, change the listen port to that value
2. If you changed changed the port in config.ini (default is 5000), update the proxy_pass port to be that value. This is the port on which the Flask app will run.
```
server {
        listen 80;
        location / {
                 include proxy_params;
                 proxy_pass http://127.0.0.1:5000;
        }
}
```
Restart the web server:
```
sudo service nginx restart
```

NB: If you want to run more than one web server at a time, you can simply add additional copies of the above parameters with different ports, and add corresponding Security Group inbound rules as described above.

Run it

Start a multiplexer session (tmux new -s webserver)
Run the web app: python build_topic_model_browser.py --config_filepath=config.ini
Visit the instance's public IP address to verify it works. Watch for corresponding logs in the terminal.
Exit the multiplexer session: ctrl-b d
To reattach to the session: tmux attach -t webserver
To shut down the app: ctrl-c
Remember that you'll be charged for the instance as long as it is running. You can use the AWS EC2 console to stop the instance to stop being charged, as well as restarting it to pick up where you left off.

Name		Name	Last commit message	Last commit date
Latest commit History 458 Commits
browser		browser
input		input
tom_lib		tom_lib
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
assess_topics.py		assess_topics.py
build_topic_model_browser.py		build_topic_model_browser.py
config_template.ini		config_template.ini
infer_topics.py		infer_topics.py
python_env_setup.sh		python_env_setup.sh
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TOM

Features

Vector space modeling

Topic modeling

Estimating the optimal number of topics

Installation

Provided scripts

Expected corpus format

Interactive usage

Load and prepare a corpus

Instantiate a topic model and infer topics

Instantiate a topic model and estimate the optimal number of topics

Save/load a topic model

Print information about a topic model

Run the browser app on a remote web server

Setup

Run it

About

Releases

Packages

Languages

License

warmlogic/TOM

Folders and files

Latest commit

History

Repository files navigation

TOM

Features

Vector space modeling

Topic modeling

Estimating the optimal number of topics

Installation

Provided scripts

Expected corpus format

Interactive usage

Load and prepare a corpus

Instantiate a topic model and infer topics

Instantiate a topic model and estimate the optimal number of topics

Save/load a topic model

Print information about a topic model

Run the browser app on a remote web server

Setup

Run it

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages