SimCLS for custom dataset and experiment on paper titling task

Overview

This repo is a replica of SimCLS for abstract text summarization. Unlike the original source code, we add some code for generating summary candidates and simplifying the training process. And we also tried this framework with different architecture including deBerta and Bert as scorers for our covid paper titling task.

Lacking of computational power, the generative model we use is a T5_small finetuned on our dataset.

1. How to Install

Requirements

python3
conda create --name env --file spec-file.txt
pip3 install -r requirements.txt
compare_mt -> https://github.com/neulab/compare-mt

Description of Codes

main.py -> training scorer model
model.py -> models
data_utils.py -> dataloader
utils.py -> utility functions
preprocess.py -> data preprocessing
generat_cand.py -> generate candidate summaries for training
finetune_model.py -> finetune your own generative model
evaluate_model.py -> evalualte model with trained scorer

Workspace

Following directories should be created for our experiments.

./cache -> storing model checkpoints

2. Dataset

Need to know that the dataset in this repo clean_covid.csv is just a sample dataset only contain 10000 records, if you want to access to the full data, please refer to the following link.

The COVID-19 Open Research Dataset.

3. Generating candidates

To generate candidates please run:

!python gen_candidate.py --generator_name {args.generator_name} --dataset_name {args.dataset_name} --dataset_percent {args.dataset_percent} --num_cands {args.num_cands}

generator_name: is the path to previously finetuned generator. Here in our case we use a T5_small model finetuned on CORD dataset.
dataset_name: is the path to dataset. (need to be a csv file, and column name for source document should be abstract, column name for reference summary should be title). dataset_percent: percent of data are used to generate, for test you can use smal percent of dataset to debug. Default to 100.
num_cands: Num of candidates you want to generate.

Generated candidate are stored in the forder 'candidates/{args.generator_name}_{args.num_cands}'.

For data preprocessing, please run

python preprocess.py --src_dir [path of the raw data] --tgt_dir [output path] --split [train/val/test] --cand_num [number of candidate summaries]

src_dir is the candidate folder: 'candidates/{args.generator_name}_{args.num_cands}'.

The preprocessing precedure will store the processed data as seperate json files in tgt_dir.

4. scorer training

Hyper-parameter Setting

You may specify the hyper-parameters in main.py.

Train

python main.py --cuda --gpuid [list of gpuid] -l

Fine-tune

python main.py --cuda --gpuid [list of gpuid] -l --model_pt [model path]

model path should be a subdirectory in the ./cache directory, e.g. cnndm/model.pt (it shouldn't contain the prefix ./cache/).

Evaluate

python evaluate_model.py --generator_name {args.generator_name} --dataset_name {args.dataset_name} --scorer_path cache/22-12-17-0/scorer.bin --dataset_percent 10

5. Results

T5 small model untuned on Covid Dataset

	Before SimCLS	RoBERTa-8
Rouge1	0.2288	0.2269
Rouge2	0.0764	0.0735
RougeL	0.1864	0.1823

T5 small model tuned on Covid Dataset

	Before SimCLS	RoBERTa-8	RoBERTa-16	BERT-16	ELECTRA-16	xlm-roberta-16	distilbert-16	albert-16	deberta-16
Rouge1	0.4266	0.4306	0.4145	0.4258	0.4173	0.4145	0.4269	0.4236	0.4171
Rouge2	0.2222	0.2077	0.2009	0.2099	0.2059	0.2009	0.2061	0.2067	0.1976
RougeL	0.3659	0.3598	0.3454	0.3531	0.3486	0.3454	0.3543	0.3542	0.3465

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimCLS for custom dataset and experiment on paper titling task

Overview

1. How to Install

Requirements

Description of Codes

Workspace

2. Dataset

3. Generating candidates

4. scorer training

Hyper-parameter Setting

Train

Fine-tune

Evaluate

5. Results

T5 small model untuned on Covid Dataset

T5 small model tuned on Covid Dataset

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
compare_mt		compare_mt
Bert Scorer-16 evaluate.ipynb		Bert Scorer-16 evaluate.ipynb
README.md		README.md
SimCLSTry.ipynb		SimCLSTry.ipynb
SimCLS_work_flow.ipynb		SimCLS_work_flow.ipynb
clean_covid.csv		clean_covid.csv
data_utils.py		data_utils.py
evaluate_model.py		evaluate_model.py
finetune_model.py		finetune_model.py
gen_candidate.py		gen_candidate.py
generate_cand.py		generate_cand.py
main.py		main.py
model.py		model.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train_scorer_notebook_7_mode.textClipping		train_scorer_notebook_7_mode.textClipping
train_scorer_notebook_RoBERTa_8.ipynb		train_scorer_notebook_RoBERTa_8.ipynb
train_scorer_notebook_large.ipynb		train_scorer_notebook_large.ipynb
utils.py		utils.py

KaiyuHe998/SimCLS_for_custom_dataset

Folders and files

Latest commit

History

Repository files navigation

SimCLS for custom dataset and experiment on paper titling task

Overview

1. How to Install

Requirements

Description of Codes

Workspace

2. Dataset

3. Generating candidates

4. scorer training

Hyper-parameter Setting

Train

Fine-tune

Evaluate

5. Results

T5 small model untuned on Covid Dataset

T5 small model tuned on Covid Dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages