This repository contains the data, codes and model checkpoints for our paper "ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization".
- 2022.02.23: We released our model checkpoints and model outputs.
- 2022.02.22: We released the ClidSum benchmark dataset.
- 2022.02.14: We released our paper. Check it out!
- Overview
- ClidSum Benchmark Dataset
- Model List
- Use mDialBART with Huggingface
- Finetune mDialBART
- Citation and Contact
In this work, we introduce cross-lingual dialogue summarization task and present ClidSum
benchmark dataset together with mDialBART
pre-trained language model.
ClidSum
containsXSAMSum
,XMediaSum40k
andMediaSum424k
three subsets.mDialBART
extends mBART-50 via the second pre-training stage. The following figure is an illustration of ourmDialBART
.
Please restrict your usage of this dataset to research purpose only.
You can obtain ClidSum
from the share links: XSAMSum, XMediaSum40k.
For MediaSum424k
, please refer to the MediaSum Repository since MediaSum424k
is the subset of MediaSum. The following table shows the statistics of our ClidSum
.
The format of obtained JSON files is as follows:
[
{
"dialogue": "Hannah: Hey, do you have Betty's number?\nAmanda: Lemme check\nHannah: <file_gif>\nAmanda: Sorry, can't find it.\nAmanda: Ask Larry\nAmanda: He called her last time we were at the park together\nHannah: I don't know him well\nHannah: <file_gif>\nAmanda: Don't be shy, he's very nice\nHannah: If you say so..\nHannah: I'd rather you texted him\nAmanda: Just text him 🙂\nHannah: Urgh.. Alright\nHannah: Bye\nAmanda: Bye bye",
"summary": "Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.",
"summary_de": "hannah braucht bettys nummer, aber amanda hat sie nicht. sie muss larry kontaktieren.",
"summary_zh": "汉娜需要贝蒂的电话号码,但阿曼达没有。她得联系拉里。"
},
{
"dialogue": "Eric: MACHINE!\r\nRob: That's so gr8!\r\nEric: I know! And shows how Americans see Russian ;)\r\nRob: And it's really funny!\r\nEric: I know! I especially like the train part!\r\nRob: Hahaha! No one talks to the machine like that!\r\nEric: Is this his only stand-up?\r\nRob: Idk. I'll check.\r\nEric: Sure.\r\nRob: Turns out no! There are some of his stand-ups on youtube.\r\nEric: Gr8! I'll watch them now!\r\nRob: Me too!\r\nEric: MACHINE!\r\nRob: MACHINE!\r\nEric: TTYL?\r\nRob: Sure :)",
"summary": "Eric and Rob are going to watch a stand-up on youtube.",
"summary_de": "eric und rob werden sich ein stand-up auf youtube ansehen.",
"summary_zh": "埃里克和罗伯要在youtube上看一场单口相声。"
},
...
]
summary
represents the original English summary of the corresponding dialogue
. summary_de
and summary_zh
indicate the human-translated German and Chinese summaries, respectively.
In addition, as described in our paper, XMediaSum40k
is constructed based on 40k samples randomly selected from the MediaSum corpus (totally 463k samples). To know which samples are selected, you can find original ID
of each sample from train_val_test_split.40k.json file.
License: CC BY-NC-SA 4.0
Our released models are listed as following. You can import these models by using HuggingFace's Transformers.
Model | Checkpoint |
---|---|
mDialBART (En-De) | Krystalan/mdialbart_de |
mDialBART (En-Zh) | Krystalan/mdialbart_zh |
You can easily import our models with HuggingFace's transformers
:
from transformers import MBartForConditionalGeneration, MBartTokenizer, MBart50TokenizerFast
# The tokenizer used in mDialBART is based on the mBART50's tokenizer, and we only add a special token [SUM] to indicate the summarization task during the second pre-training stage.
tokenizer = MBart50TokenizerFast.from_pretrained('facebook/mbart-large-50-many-to-many-mmt', src_lang='en_XX', tgt_lang='de_DE')
tokenizer.add_tokens(['[summarize]'])
# Import our models. The package will take care of downloading the models automatically
model = MBartForConditionalGeneration.from_pretrained('Krystalan/mdialbart_de')
In the following section, we describe how to finetune a mdialbart model by using our code.
Please run the following script to install the dependencies:
pip install -r requirements.txt
[TODO]
Output summaries are available at outputs directory.
mdialbart_de.txt
: the output German summaries of fine-tuningmDialBART
onXMediaSum40k
.mdialbart_zh.txt
: the output Chinese summaries of fine-tuningmDialBART
onXMediaSum40k
.mdialbart_de_da.txt
: the output German summaries of fine-tuningmDialBART
onXMediaSum40k
with the help of pseudo samples (DA).mdialbart_zh_da.txt
: the output Chinese summaries of fine-tuningmDialBART
onXMediaSum40k
with the help of pseudo samples (DA).
For ROUGE Scores, we utilize the Multilingual ROUGE Scoring toolkit. The evaluation command like this:
python rouge.py \
--rouge_types=rouge1,rouge2,rougeL \
--target_filepattern=gold.txt \
--prediction_filepattern=generated.txt \
--output_filename=scores.csv \
--lang="german" \ # "zhongwen" for Chinese
--use_stemmer=true
For BERTScore, you should first download the chinese-bert-wwm-ext and bert-base-german-uncased models, and then use the bert_score toolkit. The evaluation command like this:
model_path=xxx/chinese-bert-wwm-ext # For Chinese
model_path=xxx/bert-base-german-uncased # For German
bert-score -r $gold_file_path -c $generate_file_path --lang zh --model $model_path --num_layers 8 # For Chinese
bert-score -r $gold_file_path -c $generate_file_path --lang de --model $model_path --num_layers 8 # For German
If you find this work is useful or use the data in your work, please consider cite our paper:
@article{Wang2022ClidSumAB,
title={ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization},
author={Jiaan Wang and Fandong Meng and Ziyao Lu and Duo Zheng and Zhixu Li and Jianfeng Qu and Jie Zhou},
year={2022},
eprint={2202.05599},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Please feel free to email Jiaan Wang (jawang1[at]stu.suda.edu.cn) for questions and suggestions.