This is the repository for Final Year Project: Hierarchical Document Representation for Summarization submitted to Nanyang Technological University(https://hdl.handle.net/10356/157571). Much of the codes are referenced from github repository from NLPYang used for EMNLP 2019 paper Text Summarization with Pretrained Encoders. Some codes are borrowed from ONMT(https://github.com/OpenNMT/OpenNMT-py)
Please refer to the referenced repository to download or preprocess the required dataset. Please note that the raw dataset should be preprocessed differently for ALBERT because of the difference in vocab file.
Please note that there are some added codes for processing ALBERT since the vocab file is different for ALBERT(vocab size 30000) as compared to BERT or DistilBERT (vocab size 30522). The only difference is in step 5:
python preprocess.py -mode format_to_bert -raw_path JSON_PATH -save_path BERT_DATA_PATH -lower -n_cpus 1 -log_file ../logs/preprocess.log -pretrained_model albert
JSON_PATH
is the directory containing json files (../json_data
),BERT_DATA_PATH
is the target directory to save the generated binary files (../bert_data
or../albert_data
)-pretrained_model
is added with argumentsbert
oralbert
ordistilbert
options wherealbert
will have a difference in preprocessing in our repo.
Please refer to requirements.txt Important packages:
- torch==1.1.0
- transformers==4.16.2
- sentencepiece==0.1.96
- pyrouge
- tensorboardX==1.9
- multiprocess==0.70.9
- pytorch-transformers==1.2.0
- nltk==3.7
- sentencepiece==0.1.96
python train.py -task ext -mode train -bert_data_path ALBERT_DATA_PATH -ext_dropout 0.1 -model_path INTENDED_CHECKPOINT_SAVED_DIR -lr 2e-3 -visible_gpus 0,1,2,3 -report_every 100 -save_checkpoint_steps 1000 -train_steps 10000 -accum_count 2 -log_file ../logs/ext_bert_cnndm -use_interval true -warmup_steps 2000 -max_pos 512 -other_bert albert -batch_size 300 -doc_weight 0.4 -extra_attention False -sharing True
Arguments:
- -task ext (fixed to extractive summarization only for HIWESTSUM)
- -mode train (train/validate/test)
- -bert_data_path ALBERT_DATA_PATH (IMPORTANT, change ALBERT_DATA_PATH to your preprocessed data directory, eg: ../albert_data/albert_data)
- -ext_dropout 0.1 (dropout rate)
- -model_path INTENDED_CHECKPOINT_SAVED_DIR (IMPORTANT, change INTENDED_CHECKPOINT_SAVED_DIR to the intended output directory for model checkpoints, eg: ./hiwest/albert0.4)
- -lr 2e-3 (learning rates, suggested learning rates for BERT are 5e-5, 3e-5, 2e-5)
- -visible_gpus 0,1,2,3 (IMPORTANT, set the GPU(s) to be used)
- -save_checkpoint_steps 1000 (set checkpoint to be saved every X step, X is the argument)
- -report_every 100 (set progress to be reported by logging every X step, X is the argument)
- -train_steps 10000 (set total training steps)
- -log_file ../logs/ext_bert_cnndm (log file location)
- -max_pos 512 (set max position.For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.)
- -other_bert albert (IMPORTANT, set the pretrained model to be used, albert/bert/distilbert)
- -batch_size 300 (set batch size)
- -doc_weight 0.4 (IMPORTANT, set the document weight. document_weight + sent_weight = 1
- -sharing True (IMPORTANT, determine if weights are shared between the bert layer and extractive layers.
python train.py -task ext -mode validate -batch_size 300 -test_batch_size 500 -bert_data_path ../albert_data/albert_data -log_file /home/students/s121md102_06/bertsum_experiment/PreSummWithMobileBert/logs/val_hiwest_distilbert_cnndm -model_path ./hiwest/albert0.4 -sep_optim true -use_interval true -visible_gpus 0,1,2,3 -max_pos 512 -max_length 200 -alpha 0.95 -min_length 50 -result_path ../logs/hiwestsum_al0.4 -other_bert albert -architecture hiwest -doc_weight 0.4 -extra_attention False -sharing True
Most extractive summarization models usually employ a hierarchical encoder for document summarization. However, these extractive models are solely using document-level information to classify and select sentences which may not be the most effective way.
In addition, most state-of-the-art (SOTA) models will be using huge number of parameters to learn from a large amount of data, and this causes the computational costs to be very
expensive.
In this project, Hierarchical Weight Sharing Transformers for Summarization (HIWESTSUM) is proposed for document summarization. HIWESTSUM is very light in weight with parameter size over 10 times smaller than current existing models that finetune BERT for summarization. Moreover, the proposed model is faster than SOTA models with shorter training and inference time. It learns effectively from both sentence and document level representations with weight sharing mechanisms.
By adopting weight sharing and hierarchical learning strategies, it is proven in this project that the proposed model HIWESTSUM may reduce the usage of computational resources for summarization and achieve comparable results as SOTA models when trained on smaller datasets.