Skip to content

Code of "Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding" 2024

License

Notifications You must be signed in to change notification settings

CyberAgentAILab/diverse-mbr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Diverse Minimum Bayes Risk Decoding

This repository contains the code for the experiments in Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding.

The code is tested on Ubuntu 20.04 using Python 3.8 and CUDA 11.0 (Docker image nvidia/cuda:11.0.3-cudnn8-devel-ubuntu20.04). The code is provided mostly as is with little effort on refactoring.

Installation

git clone [email protected]:CyberAgentAILab/diverse-mbr
cd diverse-mbr
pip install -r requirements.txt

Usage

The code runs in two steps.

  1. sample.sh samples candidates.
  2. run_mbr.sh computes the MBR candidate from the candidates sampled.

Sampling candidates

./experiments/sample.sh -d [DATASET] -s [NUMBER OF SAMPLES] 

Computing Diverse MBR and KMBR

./experiments/run_mbr.sh -d [DATASET] -s [NUMBER OF SAMPLES] -a [ALGORITHM]

Example on WMT'19 En-De

  1. Use sacrebleu to prepare the benchmark dataset.
mkdir -p ./dataset/wmt19-text
sacrebleu -t wmt19 -l en-de --echo src > ./dataset/wmt19-text/wmt19.en-de.en
sacrebleu -t wmt19 -l en-de --echo ref > ./dataset/wmt19-text/wmt19.en-de.de
  1. Sample candidates on WMT'19 En-De
./experiments/sample.sh -d wmt19.en-de
  1. Computing Diverse MBR and K-Medoid MBR on WMT'19 En-De
./experiments/run_mbr.sh -d wmt19.en-de -m wmt19-en-de -a diverse

Reference

Yuu Jinnai, Ukyo Honda, Tetsuro Morimura, and Peinan Zhang. 2024. Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding. In Findings of the Association for Computational Linguistics ACL 2024, pages 8494–8525, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.

Bibtex:

@inproceedings{jinnai-etal-2024-generating,
    title = "Generating Diverse and High-Quality Texts by Minimum {B}ayes Risk Decoding",
    author = "Jinnai, Yuu  and
      Honda, Ukyo  and
      Morimura, Tetsuro  and
      Zhang, Peinan",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.503",
    pages = "8494--8525",
}

Contact

For any questions, feel free to raise an issue or contact me at [email protected].

Acknowledgements

MS COCO dataset is licensed under a Creative Commons BY 4.0.

About

Code of "Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding" 2024

Resources

License

Stars

Watchers

Forks