This repo is a replica of SimCLS for abstract text summarization. Unlike the original source code, we add some code for generating summary candidates and simplifying the training process. And we also tried this framework with different architecture including deBerta and Bert as scorers for our covid paper titling task.
Lacking of computational power, the generative model we use is a T5_small finetuned on our dataset.
python3
conda create --name env --file spec-file.txt
pip3 install -r requirements.txt
compare_mt
-> https://github.com/neulab/compare-mt
main.py
-> training scorer modelmodel.py
-> modelsdata_utils.py
-> dataloaderutils.py
-> utility functionspreprocess.py
-> data preprocessinggenerat_cand.py
-> generate candidate summaries for trainingfinetune_model.py
-> finetune your own generative modelevaluate_model.py
-> evalualte model with trained scorer
Following directories should be created for our experiments.
./cache
-> storing model checkpoints
Need to know that the dataset in this repo clean_covid.csv is just a sample dataset only contain 10000 records, if you want to access to the full data, please refer to the following link.
To generate candidates please run:
!python gen_candidate.py --generator_name {args.generator_name} --dataset_name {args.dataset_name} --dataset_percent {args.dataset_percent} --num_cands {args.num_cands}
generator_name: is the path to previously finetuned generator. Here in our case we use a T5_small model finetuned on CORD dataset.
dataset_name: is the path to dataset. (need to be a csv file, and column name for source document should be abstract, column name for reference summary should be title).
dataset_percent: percent of data are used to generate, for test you can use smal percent of dataset to debug. Default to 100.
num_cands: Num of candidates you want to generate.
Generated candidate are stored in the forder 'candidates/{args.generator_name}_{args.num_cands}'.
For data preprocessing, please run
python preprocess.py --src_dir [path of the raw data] --tgt_dir [output path] --split [train/val/test] --cand_num [number of candidate summaries]
src_dir
is the candidate folder: 'candidates/{args.generator_name}_{args.num_cands}'.
The preprocessing precedure will store the processed data as seperate json files in tgt_dir
.
You may specify the hyper-parameters in main.py
.
python main.py --cuda --gpuid [list of gpuid] -l
python main.py --cuda --gpuid [list of gpuid] -l --model_pt [model path]
model path should be a subdirectory in the ./cache
directory, e.g. cnndm/model.pt
(it shouldn't contain the prefix ./cache/
).
python evaluate_model.py --generator_name {args.generator_name} --dataset_name {args.dataset_name} --scorer_path cache/22-12-17-0/scorer.bin --dataset_percent 10
Before SimCLS | RoBERTa-8 | |
---|---|---|
Rouge1 | 0.2288 | 0.2269 |
Rouge2 | 0.0764 | 0.0735 |
RougeL | 0.1864 | 0.1823 |
Before SimCLS | RoBERTa-8 | RoBERTa-16 | BERT-16 | ELECTRA-16 | xlm-roberta-16 | distilbert-16 | albert-16 | deberta-16 | |
---|---|---|---|---|---|---|---|---|---|
Rouge1 | 0.4266 | 0.4306 | 0.4145 | 0.4258 | 0.4173 | 0.4145 | 0.4269 | 0.4236 | 0.4171 |
Rouge2 | 0.2222 | 0.2077 | 0.2009 | 0.2099 | 0.2059 | 0.2009 | 0.2061 | 0.2067 | 0.1976 |
RougeL | 0.3659 | 0.3598 | 0.3454 | 0.3531 | 0.3486 | 0.3454 | 0.3543 | 0.3542 | 0.3465 |