Skip to content

uclHU/Tweet-Classification-for-OSR-Topics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tweet-Classification-for-OSR-Topics

Dataset

There are in total 11 datasets, including the original dataset from OSR, which is renamed as "osr_tweets_origin.csv". The descriptions of all the datasets are listed as follows:

osr_tweets_origin.csv: the original dataset from OSR with 20 topics.

osr_tweets_without_T_U_U.csv: the filtered dataset, which excludes hashtag, username and url in each tweet, and it involves 20 topics.

osr_tweets_without_S_T_U_U.csv: same to osr_tweets_without_T_U_U.csv, but also exclude stop words and each tweet is lemmatized. This dataset also has 20 topics.

osr_tweets_origin_v2.csv: each tweet in this dataset is the same to the original one, but the topics in this set is merged to 8.

osr_tweets_without_T_U_U_v2.csv: same as osr_tweets_without_T_U_U.csv, but the number of topics in this set is merged to 8.

osr_tweets_without_S_T_U_U_v2.csv: same as osr_tweets_without_S_T_U_U.csv, but the number of topics in this set is merged to 8.

gpt_tweets.csv & gpt_tweets_origin.csv: each tweet is generated by GPT, and there are 1000 tweets for each topic. There are in total 9 topics, with an additional one than the above merged dataset. The additional one is required by the project brief.

gpt_tweets_without_T_U_U.csv: each tweet is processed to exclude hashtag, username and url.

test_set.csv: the test set for evaluation of all models. It is the shuffled dataset from osr_tweets_without_T_U_U_v2.csv, with each label represented by a number.

The mapping of the index to the text label follows:

0 : Children, Education and Skills

1 : Health and Social Care

2 : Crime and Security

3 : Economy

4 : Housing, Planning and Local Services

5 : Labour Market and Welfare

6 : Population and Society

7 : Transport, Environment and Climate Change

Python notebook file

All these python notebook files can execute individually, as they originally are written and executed on Google Colab. For fine-tuning language models, it requires an extra step to login to hugging face account in order to store the trained models. The description of these files is listed as follows:

Bert_grid_search_on_hyperparameter_with_preprocessing.ipynb: train the grid search model using "osr_tweets_without_T_U_U_v2.csv".

Bert_grid_search_on_hyperparameter_without_preprocessing.ipynb: train the grid search model using "osr_tweets_origin_v2.csv".

BertLarge_with_preprocessing.ipynb: train the model using "osr_tweets_without_T_U_U_v2.csv".

BertLarge_without_preprocessing.ipynb: train the model using "osr_tweets_origin_v2.csv".

DistilBert_grid_search_on_hyperparameter_with_preprocessing.ipynb: train the grid search model using "osr_tweets_without_T_U_U_v2.csv".

DistilBert_grid_search_on_hyperparameter_without_preprocessing.ipynb: train the grid search model using "osr_tweets_origin_v2.csv".

DistilBert_GPTdata_grid_search.ipynb: train the grid search model using "gpt_tweets_without_T_U_U.csv".

OSR_LDA.ipynb: train LDA models with the number of clusters ranging from 10 to 100 with step size 5. An evaluation on accuracy is also included in this file.

OSR_BerTopic.ipynb: train BERTopic models with the number of clusters ranging from 10 to 100 with step size 5. An evaluation on accuracy is also included in this file.

gzip_knn.ipynb: train a model which combines gzip compressor and KNN. The file includes an exhaust search on k, with k set to 1 to 8.

zero_shot.ipynb: this file loads a pre-trained deberta model, and evaluates its prediction accuracy using "osr_tweets_without_T_U_U_v2.csv" and "osr_tweets_origin_v2.csv".

performance_evaluation.ipynb: this file evaluates the above fine-tuned models as well as two ensemble models.

preprocessing.ipynb: includes all the preprocessing on the dataset.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published