Skip to content
/ TOM Public
forked from AdrienGuille/TOM

A library for topic modeling and browsing

License

Notifications You must be signed in to change notification settings

warmlogic/TOM

 
 

Repository files navigation

TOM

This is a heavily overhauled version of this library (GitHub repo). New features include:

  • configuration files
  • ability to set topic model hyperparameters
  • major computational speedups
  • additional assessment metrics for choosing an appropriate number of topics
  • normalized topic loadings
  • new charts
  • interactive charts
  • access to raw data
  • a topic loading similarity browser app

TOM (TOpic Modeling) is a Python 3 library for topic modeling and browsing, licensed under the MIT license. Its objective is to allow for an efficient analysis of a text corpus from start to finish, via the discovery of latent topics. To this end, TOM features functions for preparing and vectorizing a text corpus, though you may want to perform additional preprocessing steps on the corpus before topic modeling. It also offers a common interface for two topic models (LDA using either variational inference or Gibbs sampling, and NMF using alternating least squares with a projected gradient method), and implements five state-of-the-art methods for estimating the optimal number of topics to model a corpus. TOM constructs an interactive web browser-based application that makes it easy to explore a topic model and the related corpus.

Features

Vector space modeling

  • Feature selection based on word frequency
    • Via the parameters max_relative_frequency, min_absolute_frequency, and max_features
  • Weighting
    • tf-idf (for NMF)
    • tf (for LDA)

Topic modeling

Estimating the optimal number of topics

Installation

  1. Clone this repo: git clone [email protected]:warmlogic/TOM.git

  2. In a terminal, cd to the TOM directory

  3. Run the following command to install Miniconda (Python 3) and the required libraries (installed in the base conda environment):

    ./python_env_setup.sh
  4. Log out and back in to ensure the base conda environment is active

List of the installed libraries:

Provided scripts

The provided scripts use the parameters defined in the configuration file. Copy the config_template.ini file to config.ini and set it up as desired.

In order of importance, the scripts are:

  1. assess_topics.py: Produce artifacts used for estimating the optimal number of topics
  2. build_topic_model_browser.py: Run a local web server and generate a web browser-based application for exploring the topic model and corpus
  3. infer_topics.py: Simply train and save topic models for a range of numbers of topics

Run a script in the terminal using the following command structure:

python <script name> --config_filepath=<config file name>

Expected corpus format

A corpus is a TSV (tab separated values) file describing documents. This is formatted as one document per line, with the following columns:

  • id: a unique identifier
  • affiliation: for grouping documents within a dataset
  • dataset: used when combining documents from various sources
  • title
  • author
  • date: preferably formatted as YYYY-MM-DD
  • text: the text on which to train the topic model, which may be preprocessed in various ways
  • orig_text: the original text of the document (optional; if absent, will use text column)
id	affiliation	dataset	title	author	date	text	orig_text
doc1	journal1	dataset1	Document 1's title	Author 1	2019-01-01	full content document 1	Full content of document 1.
doc2	journal2	dataset1	Document 2's title	Author 2	2019-05-01	full content document 2	Full content of document 2.
etc.

Interactive usage

The following code snippets get run in the provided scripts. They are shown below to demonstrate how to interact with them. You'll need to import the required classes as follows:

from tom_lib.structure.corpus import Corpus
from tom_lib.nlp.topic_model import NonNegativeMatrixFactorization, LatentDirichletAllocation
from tom_lib.visualization.visualization import Visualization

Load and prepare a corpus

The following code snippet shows how to load a corpus vectorize them using tf-idf with unigrams.

corpus = Corpus(
    source_filepath='input/raw_corpus.tsv',
    vectorization='tfidf',
    n_gram=1,
    max_relative_frequency=0.8,
    min_absolute_frequency=4,
)
print(f'Corpus size: {corpus.size:,}')
print(f'Vocabulary size: {corpus.vocabulary_size:,}')
print('Vector representation of document 0:\n', corpus.word_vector_for_document(doc_id=0))

Instantiate a topic model and infer topics

It is possible to instantiate a NMF or LDA object then infer topics.

NMF (use vectorization='tfidf' when creating the corpus):

topic_model = NonNegativeMatrixFactorization(corpus)
topic_model.infer_topics(num_topics=15)

LDA (using either the standard variational Bayesian inference or Gibbs sampling; use vectorization='tf' when creating the corpus):

topic_model = LatentDirichletAllocation(corpus)
topic_model.infer_topics(num_topics=15, algorithm='variational')
topic_model = LatentDirichletAllocation(corpus)
topic_model.infer_topics(num_topics=15, algorithm='gibbs')

Instantiate a topic model and estimate the optimal number of topics

Here we instantiate a NMF object, then generate plots with four metrics for estimating the optimal number of topics. Estimating the optimal number of topics may take a long time with a large corpus.

topic_model = NonNegativeMatrixFactorization(corpus)

viz = Visualization(topic_model)

viz.plot_greene_metric(
    min_num_topics=5,
    max_num_topics=50,
    step=1,
    tao=10,
    top_n_words=10,
)

viz.plot_arun_metric(
    min_num_topics=5,
    max_num_topics=50,
    step=1,
    iterations=10,
)

viz.plot_brunet_metric(
    min_num_topics=5,
    max_num_topics=50,
    step=1,
    iterations=10,
)

viz.plot_coherence_w2v_metric(
    min_num_topics=5,
    max_num_topics=50,
    step=1,
    top_n_words=10,
)

# # LDA only
# viz.plot_perplexity_metric(
#     min_num_topics=5,
#     max_num_topics=50,
#     step=1,
# )

Save/load a topic model

To allow reusing previously learned topics models, TOM can save them on disk, as shown below.

import tom_lib.utils as ut
ut.save_topic_model(topic_model, 'output/NMF_15topics.pickle')
topic_model = ut.load_topic_model('output/NMF_15topics.pickle')

Print information about a topic model

This code excerpt illustrates how one can manipulate a topic model, e.g. get the topic distribution for a document or the word distribution for a topic.

print('\nTopics:')
topic_model.print_topics(num_words=10)
print('\nTopic distribution for document 0:',
      topic_model.topic_distribution_for_document(0))
print('\nMost likely topic for document 0:',
      topic_model.most_likely_topic_for_document(0))
print('\nFrequency of topics:',
      topic_model.topics_frequency())
print('\nTop 10 most relevant words for topic 2:',
      topic_model.top_words(2, 10))

Run the browser app on a remote web server

Setup

  1. Create a compute instance using a service like AWS EC2 or GCP

    1. Handy repo for easily provisioning an AWS EC2 instance
    2. The instance size you choose depends on the size of your corpus and topic model parameters. Training a model will take significantly more RAM and CPU than simply hosting the app with a pre-trained model. It's possible to train the model on your own computer and rsync the files to the instance for running the app.
      1. RAM suggestion for training a new model: 8 GB (or more)
      2. RAM suggestion for running the app with a pre-trained model: 2 GB
  2. If using AWS, edit the Security Group inbound rules and create a custom TCP rule to allow inbound traffic on port 80.

    1. This is the port that someone will use to access the web app from the outside world.
  3. SSH into the instance

  4. Install a web server:

    sudo apt-get install nginx
  5. Edit the file /etc/nginx/sites-enabled/default to contain only the following text.

    1. If you are using a custom port for accessing the web app, change the listen port to that value
    2. If you changed changed the port in config.ini (default is 5000), update the proxy_pass port to be that value. This is the port on which the Flask app will run.
    server {
            listen 80;
            location / {
                     include proxy_params;
                     proxy_pass http://127.0.0.1:5000;
            }
    }
    
  6. Restart the web server:

    sudo service nginx restart

NB: If you want to run more than one web server at a time, you can simply add additional copies of the above parameters with different ports, and add corresponding Security Group inbound rules as described above.

Run it

  1. Start a multiplexer session (tmux new -s webserver)
  2. Run the web app: python build_topic_model_browser.py --config_filepath=config.ini
  3. Visit the instance's public IP address to verify it works. Watch for corresponding logs in the terminal.
  4. Exit the multiplexer session: ctrl-b d
  5. To reattach to the session: tmux attach -t webserver
  6. To shut down the app: ctrl-c
  7. Remember that you'll be charged for the instance as long as it is running. You can use the AWS EC2 console to stop the instance to stop being charged, as well as restarting it to pick up where you left off.

Releases

No releases published

Packages

No packages published

Languages

  • Python 87.7%
  • HTML 10.4%
  • CSS 1.3%
  • Shell 0.6%