This is the official repository which contains all the code necessary to replicate the results from the ACL 2020 long paper Hard-Coded Gaussian Attention for Neural Machine Translation. It can also be used to train a vanilla Transformer.
The full model architecture is displayed below:
Our approach uses hard-coded Gaussian distribution instead of learned attention to simplify the Transformer architecture in neural machine translation (NMT). We replace the multi-headed attention, computed by query and key, by a fixed Gaussian distribution that focuses on the current word or somewhere near it. The figure above demonstrates how our attention differs from the vanilla Tranformer.
This code base is adapted from synst.
The code requires Python 3.7+. The python dependencies can be installed with the command (using a virtual environment is highly recommended):
pip install -r requirements.txt
If you want to use the scripts that wrap multi-bleu.perl
and
sacrebleu
, then you'll need to have
Moses-SMT available as well.
The code has one main entry point main.py
with a couple of support scripts for
the analysis conducted in the paper. Please use python main.py -h
for
additional options not listed below. You can also use python main.py <action> -h
for options specific to the available actions: {train, evaluate, translate, pass}
.
Preprocessing for wmt_en_de, wmt_en_fr and iwslt_en_de are the same as in synst:
python main.py --dataset wmt_en_de -d data/raw/wmt -p data/preprocessed/wmt -v pass
We also added IWSLT En-Ja and WMT En-Ro datasets and you can process them by the following commands.
We need subword-nmt to process subwords for both En-Ja and En-Ro. WMT16-scripts is also needed for standard processing. Mosesdecoder is needed for tokenization. We download them with the following commands to directory of our choices and set the environment variables to those directories:
git clone https://github.com/rsennrich/subword-nmt.git
git clone https://github.com/rsennrich/wmt16-scripts.git
git clone https://github.com/moses-smt/mosesdecoder.git
export SUBWORD=subword-nmt
export WMT16_SCRIPTS=wmt16-scripts
export MOSES=mosesdecoder
Then, we can download data for each dataset and process them by the following commands.
En-Ja:
mkdir data/raw/enja
mkdir data/processed/enja
RAW_PATH=data/raw/enja PROCESS_PATH=data/processed/enja bash process_new_data/process_enja.sh
En-Ro:
mkdir data/raw/enro
mkdir data/processed/enro
RAW_PATH=data/raw/enro PROCESS_PATH=data/processed/enro bash process_new_data/process_enro.sh
Please note that our preprocessing of En-Ja is not the standard way for IWSLT En-Ja dataset, so the results might be different from other works.
Assuming you have access to 8 1080Ti GPUs you can recreate the results for hard-coded self-attention model on the WMT'14 En-De dataset with:
python main.py -b 3175 --dataset wmt_en_de \
--model new_transformer \
--enc-attn-type normal --enc-attn-offset -1 1 \
--dec-attn-type normal --dec-attn-offset -1 0 \
-d data/raw/wmt -p data/preprocessed/wmt -v train \
--checkpoint-interval 1200 --accumulate 2 \
--checkpoint-directory experiments/wmt_en_de_01
The above commandline will train 8 GPUs with approximately 3175 source/target tokens combined per GPU, and accumulate the gradients over two batches before updating model parameters (leading to ~50.8k tokens per model update).
You can also recreate results on IWSLT'16 En-De dataset for hard-coded all attention model on 1 1080Ti GPU with:
python main.py -b 6000 --dataset iwslt_en_de \
--model new_transformer \
--enc-attn-type normal --enc-attn-offset -1 1 \
--dec-attn-type normal --dec-attn-offset -1 0 \
--enc-dec-attn-type normal --enc-dec-attn-offset -1 1 \
--embedding-size 288 --hidden-dim 507 --num-heads 4 --num-layers 5 \
-d data/raw/wmt -p data/preprocessed/wmt -v train \
--checkpoint-interval 600 --accumulate 1 \
--checkpoint-directory experiments/iwslt_en_de_01 \
--label-smoothing 0.0 --learning-rate-scheduler linear --learning-rate 3e-4
To recreate results for single head cross-attention, you can run:
python main.py -b 6000 --dataset iwslt_en_de \
--model new_transformer \
--enc-attn-type normal --enc-attn-offset -1 1 \
--dec-attn-type normal --dec-attn-offset -1 0 \
--enc-dec-attn-type learned \
--enc-dec-attn-layer 0 0 0 0 1 --enc-dec-attn-num-heads 0 0 0 0 1 \
--embedding-size 288 --hidden-dim 507 --num-heads 4 --num-layers 5 \
-d data/raw/wmt -p data/preprocessed/wmt -v train \
--checkpoint-interval 600 --accumulate 1 \
--checkpoint-directory experiments/iwslt_en_de_01 \
--label-smoothing 0.0 --learning-rate-scheduler linear --learning-rate 3e-4
The default model is the Transformer model. If you want to train a vanilla Tranformer model on the WMT'14 De-En dataset, you can run this:
python main.py -b 3175 --dataset wmt_de_en \
-d data/raw/wmt -p data/preprocessed/wmt -v train \
--checkpoint-interval 1200 --accumulate 2
You can run a separate process to evaluate each new checkpoint generated during training (you may either want to do it on a GPU not used for training or disable cuda as done below):
python main.py -b 5000 --dataset wmt_en_de \
--model new_transformer -d data/raw/wmt -p data/preprocessed/wmt \
--enc-attn-type normal --enc-attn-offset -1 1 \
--dec-attn-type normal --dec-attn-offset -1 0 \
--split valid --disable-cuda -v evaluate \
--watch-directory /tmp/stupidnmt/checkpoints
After training a model, you can generate translations with the following command (currently only translation on a single GPU is supported):
CUDA_VISIBLE_DEVICES=0 python main.py --dataset wmt_en_de \
--model new_transformer \
--enc-attn-type normal --enc-attn-offset -1 1 \
--dec-attn-type normal --dec-attn-offset -1 0 \
-d data/raw/wmt -p data/preprocessed/wmt \
--batch-size 1 --batch-method example --split test \
--restore /tmp/stupidnmt/checkpoints/checkpoint.pt \
--average-checkpoints 5 translate \
--max-decode-length 50 --length-basis input_lens --order-output
Which by default, will output translations to /tmp/stupidnmt/output
.
If you have a comet.ml account, on you can track experiments, by prefixing the script call with:
env $(cat ~/.comet.ml | xargs) python main.py --track ...
Where ~/.comet.ml
is the file which contains your API key for logging
experiments on the service. By default, this will track experiments in a
workspace named umass-nlp
with project name probe-transformer
. See args.py
in order to
configure the experiment tracking to suit your needs.
@inproceedings{you-etal-2020-hard,
title = "Hard-Coded {G}aussian Attention for Neural Machine Translation",
author = "You, Weiqiu and
Sun, Simeng and
Iyyer, Mohit",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.687",
pages = "7689--7700",
abstract = "Recent work has questioned the importance of the Transformer{'}s multi-headed attention for achieving high translation quality. We push further in this direction by developing a {``}hard-coded{''} attention variant without any learned parameters. Surprisingly, replacing all learned self-attention heads in the encoder and decoder with fixed, input-agnostic Gaussian distributions minimally impacts BLEU scores across four different language pairs. However, additionally, hard-coding cross attention (which connects the decoder to the encoder) significantly lowers BLEU, suggesting that it is more important than self-attention. Much of this BLEU drop can be recovered by adding just a single learned cross attention head to an otherwise hard-coded Transformer. Taken as a whole, our results offer insight into which components of the Transformer are actually important, which we hope will guide future work into the development of simpler and more efficient attention-based models.",
}