HiWestSum

This is the repository for Final Year Project: Hierarchical Document Representation for Summarization submitted to Nanyang Technological University(https://hdl.handle.net/10356/157571). Much of the codes are referenced from github repository from NLPYang used for EMNLP 2019 paper Text Summarization with Pretrained Encoders. Some codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py)

Data Preparation for CNN/DM

Please refer to the referenced repository to download or preprocess the required dataset. Please note that the raw dataset should be preprocessed differently for ALBERT because of the difference in vocab file.

Please note that there are some added codes for processing ALBERT since the vocab file is different for ALBERT(vocab size 30000) as compared to BERT or DistilBERT (vocab size 30522). The only difference is in step 5:

Step 5. Format to PyTorch Files

python preprocess.py -mode format_to_bert -raw_path JSON_PATH -save_path BERT_DATA_PATH  -lower -n_cpus 1 -log_file ../logs/preprocess.log -pretrained_model albert

JSON_PATH is the directory containing json files (../json_data), BERT_DATA_PATH is the target directory to save the generated binary files (../bert_data or ../albert_data)
-pretrained_model is added with arguments bert or albert or distilbert options where albert will have a difference in preprocessing in our repo.

Environment & Packages

Please refer to requirements.txt Important packages:

torch==1.1.0
transformers==4.16.2
sentencepiece==0.1.96
pyrouge
tensorboardX==1.9
multiprocess==0.70.9
pytorch-transformers==1.2.0
nltk==3.7
sentencepiece==0.1.96

Model Training

HIWESTSUM - ALBERT

python train.py -task ext -mode train -bert_data_path ALBERT_DATA_PATH -ext_dropout 0.1 -model_path INTENDED_CHECKPOINT_SAVED_DIR -lr 2e-3 -visible_gpus 0,1,2,3 -report_every 100 -save_checkpoint_steps 1000 -train_steps 10000 -accum_count 2 -log_file ../logs/ext_bert_cnndm -use_interval true -warmup_steps 2000 -max_pos 512 -other_bert albert -batch_size 300 -doc_weight 0.4 -extra_attention False -sharing True

Arguments:

-task ext (fixed to extractive summarization only for HIWESTSUM)
-mode train (train/validate/test)
-bert_data_path ALBERT_DATA_PATH (IMPORTANT, change ALBERT_DATA_PATH to your preprocessed data directory, eg: ../albert_data/albert_data)
-ext_dropout 0.1 (dropout rate)
-model_path INTENDED_CHECKPOINT_SAVED_DIR (IMPORTANT, change INTENDED_CHECKPOINT_SAVED_DIR to the intended output directory for model checkpoints, eg: ./hiwest/albert0.4)
-lr 2e-3 (learning rates, suggested learning rates for BERT are 5e-5, 3e-5, 2e-5)
-visible_gpus 0,1,2,3 (IMPORTANT, set the GPU(s) to be used)
-save_checkpoint_steps 1000 (set checkpoint to be saved every X step, X is the argument)
-report_every 100 (set progress to be reported by logging every X step, X is the argument)
-train_steps 10000 (set total training steps)
-log_file ../logs/ext_bert_cnndm (log file location)
-max_pos 512 (set max position.For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.)
-other_bert albert (IMPORTANT, set the pretrained model to be used, albert/bert/distilbert)
-batch_size 300 (set batch size)

-doc_weight 0.4 (IMPORTANT, set the document weight. document_weight + sent_weight = 1

-sharing True (IMPORTANT, determine if weights are shared between the bert layer and extractive layers.

Model Evaluation

python train.py -task ext -mode validate -batch_size 300 -test_batch_size 500 -bert_data_path ../albert_data/albert_data -log_file /home/students/s121md102_06/bertsum_experiment/PreSummWithMobileBert/logs/val_hiwest_distilbert_cnndm -model_path ./hiwest/albert0.4 -sep_optim true -use_interval true -visible_gpus 0,1,2,3 -max_pos 512 -max_length 200 -alpha 0.95 -min_length 50 -result_path ../logs/hiwestsum_al0.4 -other_bert albert -architecture hiwest -doc_weight 0.4 -extra_attention False -sharing True

Architecture of HiWestSum

Most extractive summarization models usually employ a hierarchical encoder for document summarization. However, these extractive models are solely using document-level information to classify and select sentences which may not be the most effective way. In addition, most state-of-the-art (SOTA) models will be using huge number of parameters to learn from a large amount of data, and this causes the computational costs to be very expensive.

In this project, Hierarchical Weight Sharing Transformers for Summarization (HIWESTSUM) is proposed for document summarization. HIWESTSUM is very light in weight with parameter size over 10 times smaller than current existing models that finetune BERT for summarization. Moreover, the proposed model is faster than SOTA models with shorter training and inference time. It learns effectively from both sentence and document level representations with weight sharing mechanisms.

By adopting weight sharing and hierarchical learning strategies, it is proven in this project that the proposed model HIWESTSUM may reduce the usage of computational resources for summarization and achieve comparable results as SOTA models when trained on smaller datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.idea		.idea
logs		logs
models		models
raw_data		raw_data
results		results
src		src
urls		urls
venv		venv
.gitignore		.gitignore
README.md		README.md
architecture.JPG		architecture.JPG
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HiWestSum

Data Preparation for CNN/DM

Step 5. Format to PyTorch Files

Environment & Packages

Model Training

HIWESTSUM - ALBERT

Model Evaluation

Architecture of HiWestSum

About

Releases

Packages

Languages

ruijietey/HiWestSum

Folders and files

Latest commit

History

Repository files navigation

HiWestSum

Data Preparation for CNN/DM

Step 5. Format to PyTorch Files

Environment & Packages

Model Training

HIWESTSUM - ALBERT

Model Evaluation

Architecture of HiWestSum

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages