Skip to content

The source code used for paper "Effective Seed-Guided Topic Discovery by Integrating Multiple Types of Contexts", in WSDM 2023.

License

Notifications You must be signed in to change notification settings

yzhan238/SeedTopicMine

Repository files navigation

SeedTopicMine

The source code used for paper "Effective Seed-Guided Topic Discovery by Integrating Multiple Types of Contexts", published in WSDM 2023.

Data

We use two benchmmark datasets, NYT and Yelp, in our paper, adapted from here. We use 60% as training corpus and the remaining 40% for evaluation.

Use the following command to generate PLM embeddings for the training corpus (gpu required)

python plm_emb.py

Run SeedTopicMine

Before the first run, compile CatE by

cd cate
make cate
cd ..

Then run the following command for SeedTopicMine

python main.py --dataset nyt --topic locations

Baselines

4 baselines are compared in our paper: SeededLDA, Anchored CorEx, KeyETM, and CatE.

To reproduce the results of SeededLDA and Anchored CorEx, please refer to ./baselines/SeededLDA.py and ./baselines/AnchoredCorEx.py, respectively.

To reproduce the results of KeyETM and CatE, please refer to their GitHub repositories (i.e., KeyETM and CatE).

Annotations

To compute P@k and NDCG@k scores of SeedTopicMine and the baselines, we invite five annotators to independently judge if each discovered term is discriminatively relevant to a seed. We release the annotation results in ./annotations/. For example, ./annotations/yelp_sentiment_annotation.txt is as follows:

Term	Annotator1	Annotator2	Annotator3	Annotator4	Annotator5
also	none	none	none	none	none
amazing	good	none	good	good	good
anger	bad	bad	bad	bad	bad
apathetic	bad	bad	bad	bad	bad
appalling	bad	bad	bad	bad	bad

There are 6 columns. The first column is the term. The other 5 columns are the relevant category of the term according to the 5 annotators, respectively. If a term is relevant to more than one category or is irrelevant to any category, the category will be marked as "none".

Citation

If you find the implementation useful, please cite the following paper:

@inproceedings{zhang2023effective,
  title={Effective Seed-Guided Topic Discovery by Integrating Multiple Types of Contexts},
  author={Zhang, Yu and Zhang, Yunyi and Michalski, Martin and Jiang, Yucheng and Meng, Yu and Han, Jiawei},
  booktitle={WSDM'23},
  pages={429--437},
  year={2023}
}

About

The source code used for paper "Effective Seed-Guided Topic Discovery by Integrating Multiple Types of Contexts", in WSDM 2023.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published