ETM

These are some modifications to the code for Embedding Topic Modeling. My main contributions concern adding a Python script to convert a one-line corpus into a BoW (bag of words) representation and to apply an existing topic model to this dataset.

First use your own corpus in the one-line per document format to encode it to a BoW matrix:

python3 data_new.py -c CORPUS.ol -o DATADIR

An example of the source file is available in the scripts directory (covid-tweets-sample.ol.xz).

The default list of stop words for the first step of BoW processing is for English. The scripts directory also has sample stop word lists for other languages, e.g.,

python3 data_new.py -c CORPUS-fr.ol -o DATADIR -s stop-fr.txt

If your one-line file has NOT been tokenised, it might be better to tokenise it (and possibly lower-case it) before BoW processing, for example as:

./tokenise1.sh <CORPUS-fr.ol | awk '{print(tolower($0))}' >CORPUS-fr.ollc

For other languages more advanced pre-processing would be needed, for example, proper segmentation for Chinese or lemmatisation for Russian or Turkish. Anyway, the ways to estimate a topic model remain the same as long as the file for creating the BoW dataset is in the one-document-per-line format.

You can create a new topic model from this dataset and evaluate it by running:

python3 main.py --mode train --dataset name --data_path DATADIR --num_topics 50 --epochs 50
python3 main.py --mode eval --data_path DATADIR --td --tc --tp --load_from results/etm_name_K_50_Htheta_530_RhoSize_300

The product of the topic diversity (the --td argument) by the topic coherence (the --tc argument) is a useful measure to evaluate how good the hyper-parameters are. The most important thing is to choose the right number of topics for your dataset. For other parameters, please run

python3 main.py -h

A model can be applied to a new corpus by first making a BoW dataset for it using the same dictionary as our original model (the -d argument):

python3 data_new.py -c CORPUS-NEW.ol -d DATADIR/vocab.pkl -o BOW-NEW
python3 main.py --mode apply --dataset dataname -b BOW-NEW --output CORPUSNEW.topics --load_from results/etm_dataname_K_50....

The remainder is practically the same as in the original repository (https://github.com/adjidieng/ETM) apart from more systematic parameters.

This has been tested to work with Python 3.7 and Pytorch 1.7.1, but other versions are likely to be ok as well.

For a large general-purpose corpus, I have achieved fairly good interpretable results by estimating 25 topics on ukWac with the resulting Topic Diversity of 0.78 and Topic Coherence of 0.195. If you have a tokenised corpus in the one-line format, you can apply this model (from the ./results directory) to your corpus by encoding this corpus first with the same ukWac dictionary into a BoW dataset and then applying the model:

python3 scripts/data_new.py -c CORPUS-NEW.ol -d results/vocab.pkl -o BOW-NEW
python3 main.py --mode apply -b BOW-NEW -d results/vocab.pkl -l results/etm_ukwac_K_25_Htheta_350_RhoSize_220

I've applied this model to test the degree to which pre-trained language models (like Bert) can be fooled by topic distributions

@inproceedings{roussinov-sharoff-2023-bert,
    title = "{BERT} Goes Off-Topic: Investigating the Domain Transfer Challenge using Genre Classification",
    author = "Roussinov, Dmitri  and Sharoff, Serge",
    editor = "Bouamor, Houda  and Pino, Juan  and Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.34/",
    doi = "10.18653/v1/2023.findings-emnlp.34",
    pages = "468--483"
}

ETM is particularly useful for estimating and interpreting topic models from short texts such as those from social media. I have created this update for our project on analysis of COVID communication where it was used to estimate the topics of COVID-related texts in our collection:

id	Keywords
0	patients, disease, infection, respiratory, study, clinical, patient, severe, treatment, symptoms, risk, acute, studies, blood
1	https, corona, covid, pandemic, virus, lockdown, time, good, today, covid19, read, great, make, day, fight, world, hope, free
3	play, game, season, year, show, time, love, playing, music, team, dropped, night, video, players, fans, tv, football, person
4	uk, government, lockdown, coronavirus, people, travel, nhs, restrictions, public, johnson, week, measures, rules, health, boris
5	health, care, pandemic, public, medical, risk, research, social, patients, services, disease, mental, measures, information
6	mask, social, distancing, face, wear, hands, hand, wearing, air, water, food, spread, wash, distance, sanitizer, buy, protect
7	home, work, stay, school, safe, family, children, schools, kids, parents, day, time, online, child, students, friends, families
9	people, media, stop, government, don, covid, fake, wrong, science, political, pandemic, truth, blame, fear, real, fact, stupid
10	trump, president, america, white, pandemic, house, vote, police, bill, response, american, biden, election, states, donald
11	cases, deaths, 2020, 000, total, number, death, confirmed, india, rate, coronavirus, 10, reported, million, 24, recovered, days
12	virus, vaccine, human, influenza, transmission, diseases, infectious, species, animals, samples, study, infected, strains
13	corona, india, sir, govt, due, students, situation, pm, exams, lockdown, delhi, indian, exam, fight, modi, minister, request
15	cells, al, viral, virus, protein, cell, viruses, infection, rna, proteins, human, expression, 10, gene, fig, dna, activity
16	people, covid, virus, die, death, flu, vaccine, lives, care, dying, don, sick, numbers, homes, infected, risk, immunity, dead
17	data, model, time, number, based, disease, analysis, information, study, models, results, population, system, set, rate, approach
18	things, ve, time, don, thing, good, happen, people, lot, feel, happened, bad, back, life, make, ago, years, long, ll
19	business, money, pay, economy, market, crisis, pandemic, economic, impact, jobs, businesses, industry, financial, food
20	china, world, virus, country, chinese, pandemic, global, war, people, wuhan, south, spread, africa, human, rights, europe
21	corona, god, shit, virus, fuck, gonna, fucking, lol, man, love, covid, bc, ass, im, damn, ur, dont, wanna, ppl
22	positive, test, state, hospital, quarantine, https, covid, coronavirus, health, symptoms, days, city, contact, case, app
24	https, coronavirus, news, live, latest, amid, outbreak, updates, uk, daily, report, top, bbc, times, wave, breaking, drug, sign

For example, Topics 0, 15 and 17 are mostly coming from research updates, Topics 11 and 24 are from forwarded news items, while Topics 16, 18, 21 are mostly coming from informal exchanges. The model also detects topics discussed in specific communities (Topics 4, 10 and 13).

This has been reported in

@article{boumechaal2024attitudes,
  title={Attitudes, communicative functions, and lexicogrammatical features of anti-vaccine discourse on Telegram},
  author={Boumechaal, Souad and Sharoff, Serge},
  journal={Applied Corpus Linguistics},
  volume={4},
  number={2},
  year={2024},
  publisher={Elsevier},
  url = "https://ssharoff.github.io/publications/2023-applied-covid.pdf",
  doi="https://doi.org/10.1016/j.acorp.2024.100095"
}

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
data/20ng		data/20ng
results		results
scripts		scripts
LICENSE		LICENSE
README.md		README.md
data.py		data.py
etm.py		etm.py
main.py		main.py
skipgram.py		skipgram.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETM

About

Releases

Packages

Languages

License

ssharoff/ETM

Folders and files

Latest commit

History

Repository files navigation

ETM

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages