SeedTopicMine

The source code used for paper "Effective Seed-Guided Topic Discovery by Integrating Multiple Types of Contexts", published in WSDM 2023.

Data

We use two benchmmark datasets, NYT and Yelp, in our paper, adapted from here. We use 60% as training corpus and the remaining 40% for evaluation.

Use the following command to generate PLM embeddings for the training corpus (gpu required)

python plm_emb.py

Run SeedTopicMine

Before the first run, compile CatE by

cd cate
make cate
cd ..

Then run the following command for SeedTopicMine

python main.py --dataset nyt --topic locations

Baselines

4 baselines are compared in our paper: SeededLDA, Anchored CorEx, KeyETM, and CatE.

To reproduce the results of SeededLDA and Anchored CorEx, please refer to ./baselines/SeededLDA.py and ./baselines/AnchoredCorEx.py, respectively.

To reproduce the results of KeyETM and CatE, please refer to their GitHub repositories (i.e., KeyETM and CatE).

Annotations

To compute P@k and NDCG@k scores of SeedTopicMine and the baselines, we invite five annotators to independently judge if each discovered term is discriminatively relevant to a seed. We release the annotation results in ./annotations/. For example, ./annotations/yelp_sentiment_annotation.txt is as follows:

Term	Annotator1	Annotator2	Annotator3	Annotator4	Annotator5
also	none	none	none	none	none
amazing	good	none	good	good	good
anger	bad	bad	bad	bad	bad
apathetic	bad	bad	bad	bad	bad
appalling	bad	bad	bad	bad	bad

There are 6 columns. The first column is the term. The other 5 columns are the relevant category of the term according to the 5 annotators, respectively. If a term is relevant to more than one category or is irrelevant to any category, the category will be marked as "none".

Citation

If you find the implementation useful, please cite the following paper:

@inproceedings{zhang2023effective,
  title={Effective Seed-Guided Topic Discovery by Integrating Multiple Types of Contexts},
  author={Zhang, Yu and Zhang, Yunyi and Michalski, Martin and Jiang, Yucheng and Meng, Yu and Han, Jiawei},
  booktitle={WSDM'23},
  pages={429--437},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
annotations		annotations
baselines		baselines
cate		cate
datasets		datasets
LICENSE		LICENSE
README.md		README.md
caseolap.py		caseolap.py
cate.py		cate.py
main.py		main.py
plm_emb.py		plm_emb.py
rank_ensemble.py		rank_ensemble.py
utils.py		utils.py
word2vec_100.zip		word2vec_100.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SeedTopicMine

Data

Run SeedTopicMine

Baselines

Annotations

Citation

About

Releases

Packages

Contributors 2

Languages

License

yzhan238/SeedTopicMine

Folders and files

Latest commit

History

Repository files navigation

SeedTopicMine

Data

Run SeedTopicMine

Baselines

Annotations

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages