guda

Source code for paper "Generalised Unsupervised Domain Adaptation of Neural Machine Translation with Cross-Lingual Data Selection" - EMNLP21

Instalation

    pip install -r requirements.txt
    cd my_fairseq && pip install --editable ./

Preprare data

Please see experiments/prepare_data.sh

GUDA experiments

Train NMT on source task

dataset=wmt20_en_de
src=en
tgt=de
./source_task/train_nmt_src.sh $dataset $src $en

Data selections

Contrastive-based data selection

Prepare training data

# subsample source data
./data_selection/contrastive/subsample_general.sh

# binary data - using mBERT wordpiece
./data_selection/binary_data.sh $tgt $dataset

# compute pool avg representation for each sentence
./data_selection/run_compute_vector.sh $tgt $dataset

# clustering the pool avg representation
./data_selection/run_kmeans.sh $tgt $dataset

# train adaptive layer
./data_selection/train_constrastive.sh $tgt $k

# Prepare data to train domain classifier
./data_selection/prepare_domain_disc_data.sh $tgt $dataset

# Train domain classifier
./data_selection/train_domain_disc_contrastive.sh

# Selection
## encode monolingual data 
## Note that the monolingual can be partitioned into multiple sharded
## split -l 1000000 -d mono.$tgt mono.10m.shards.$tgt
./data_selection/encode_mono_wordpiece.sh $tgt
# score
./data_selection/score_en.sh $tgt $domain $k
# merge sharded
python merge_sorted_file.py --input-dir path/to/score --output-file output --file-pattern score.*

# get top 500000 sentences
head -n 500000 output > selected_cons.en

Cross entropy difference

Train generic language model

dataset=lm-encs
suffix=news
./data_selection/ced/train_lm.sh $dataset $suffix

Train in-domain language model

dataset=
suffix=
./data_selection/ced/train_lm_indomain.sh $dataset $suffix

Calculate CED and ranking

src=en
tgt=de
domain=law
./data_selection/ced/score.sh $tgt
./data_selection/ced/score-and-ced.sh $tgt $domain

UDA

## Prepare backtranslation data
./data_selection/prepare/backtranslate_all_mono.sh
./data_selection/do_backtranslate.sh $source_dataset_index $domain $gen_subset
## binarize data
./data_selection/prepare/binarize_uda_data.sh $tgt $sel_type

## Run UDA and evaluation
./target_task/run_all_dgx.sh

Citing

Please cite the following paper if you found the resources in this repository useful.

@inproceedings{vu-etal-2021-generalised,
    title = "Generalised Unsupervised Domain Adaptation of Neural Machine Translation with Cross-Lingual Data Selection",
    author = "Vu, Thuy-Trang  and
      He, Xuanli  and
      Phung, Dinh  and
      Haffari, Gholamreza",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.268",
    doi = "10.18653/v1/2021.emnlp-main.268",
    pages = "3335--3346"
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
cidds		cidds
data_selection		data_selection
experiments		experiments
my_fairseq		my_fairseq
ngram-analysis		ngram-analysis
plot		plot
preprocess		preprocess
scripts		scripts
source_task		source_task
target_task		target_task
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

guda

Instalation

Preprare data

GUDA experiments

Train NMT on source task

Data selections

Contrastive-based data selection

Cross entropy difference

UDA

Citing

About

Releases

Packages

Languages

trangvu/guda

Folders and files

Latest commit

History

Repository files navigation

guda

Instalation

Preprare data

GUDA experiments

Train NMT on source task

Data selections

Contrastive-based data selection

Cross entropy difference

UDA

Citing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages