Skip to content

Commit

Permalink
Sockeye 2 Documentation Update (awslabs#722)
Browse files Browse the repository at this point in the history
* Documentation update

* Update large data tutorial

* WMT large update
  • Loading branch information
mjdenkowski authored and fhieber committed Aug 29, 2019
1 parent 26cbc97 commit acb0815
Show file tree
Hide file tree
Showing 7 changed files with 184 additions and 31 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Each version section may have have subsections for: _Added_, _Changed_, _Removed
- Update to [MXNet 1.5.0](https://github.com/apache/incubator-mxnet/tree/1.5.0)
- Moved `SockeyeModel` implementation and all layers to [Gluon API](http://mxnet.incubator.apache.org/versions/master/gluon/index.html)
- Removed support for Python 3.4.
- Removed image captioning module
- Removed outdated Autopilot module
- Removed unused training options: Eve, Nadam, RMSProp, Nag, Adagrad, and Adadelta optimizers, `fixed-step` and `fixed-rate-inv-t` learning rate schedulers
- Updated and renamed learning rate scheduler `fixed-rate-inv-sqrt-t` -> `inv-sqrt-decay`
Expand Down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,9 @@ See the [Dockerfile documentation](sockeye_contrib/docker) for more information.
## Documentation

For information on how to use Sockeye, please visit [our documentation](https://awslabs.github.io/sockeye/).
Developers may be interested in our [developer guidelines](https://awslabs.github.io/sockeye/development.html).

- For a quickstart guide to training a large data WMT model, see the [WMT 2018 German-English tutorial](https://awslabs.github.io/sockeye/tutorials/wmt_large.html).
- Developers may be interested in our [developer guidelines](https://awslabs.github.io/sockeye/development.html).

## Citation

Expand Down
6 changes: 1 addition & 5 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,11 @@ layout: default
This is the documentation for Sockeye, a sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet Incubating.
It implements state-of-the-art encoder-decoder architectures, such as

- Deep Recurrent Neural Networks with Attention [[Bahdanau, '14](https://arxiv.org/abs/1409.0473)]
- Transformer Models with self-attention [[Vaswani et al, '17](https://arxiv.org/abs/1706.03762)]
- Fully convolutional sequence-to-sequence models [[Gehring et al, '17](https://arxiv.org/abs/1705.03122)]

In addition, this framework provides an experimental [image-to-description module](https://github.com/awslabs/sockeye/tree/master/sockeye/image_captioning) that can be used for [image captioning](image_captioning.html).

Recent developments and changes are tracked in our [CHANGELOG](https://github.com/awslabs/sockeye/blob/master/CHANGELOG.md).

If you are interested in collaborating or have any questions, please submit a pull request or [issue](https://github.com/awslabs/sockeye/issues/new).
If you are interested in collaborating or have any questions, please submit a pull request or [issue](https://github.com/awslabs/sockeye/issues/new).
You can also send questions to *sockeye-dev-at-amazon-dot-com*.
Developers may be interested in [our developer guidelines](development.html).

Expand Down
12 changes: 0 additions & 12 deletions docs/sockeye_captioning.bib

This file was deleted.

1 change: 1 addition & 0 deletions docs/tutorials.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ introduce different concepts and parameters used for training and translation.
1. [Sequence copy task](tutorials/seqcopy.html)
1. [WMT German to English news translation](tutorials/wmt.html)
1. [Domain adaptation of NMT models](tutorials/adapt.html)
1. [Large data: WMT German-English 2018](tutorials/wmt_large.html)
15 changes: 2 additions & 13 deletions docs/tutorials/wmt.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ git clone https://github.com/rsennrich/subword-nmt.git
export PYTHONPATH=$(pwd)/subword-nmt:$PYTHONPATH
```

We will visualize training progress using Tensorboard and its MXNet adaptor, `mxboard`.
We will visualize training progress using Tensorboard and its MXNet adaptor, `mxboard`.
Install it using:
```bash
pip install tensorboard mxboard
Expand Down Expand Up @@ -95,24 +95,13 @@ We can now kick off the training process:
python -m sockeye.train -d train_data \
-vs newstest2016.tc.BPE.de \
-vt newstest2016.tc.BPE.en \
--encoder rnn \
--decoder rnn \
--num-embed 256 \
--rnn-num-hidden 512 \
--rnn-attention-type dot \
--max-seq-len 60 \
--decode-and-evaluate 500 \
--use-cpu \
-o wmt_model
```

This will train a 1-layer bi-LSTM encoder, 1-layer LSTM decoder with dot attention.
Sockeye offers a whole variety of different options regarding the model architecture,
such as stacked RNNs with residual connections (`--num-layers`, `--rnn-residual-connections`),
[Transformer](https://arxiv.org/abs/1706.03762) encoder and decoder (`--encoder transformer`, `--decoder transformer`),
[ConvS2S](https://arxiv.org/pdf/1705.03122) (`--encoder cnn`, `--decoder cnn`),
various RNN (`--rnn-cell-type`) and attention (`--attention-type`) types and more.

This will train a "base" [Transformer](https://arxiv.org/abs/1706.03762) model.
There are also several parameters controlling training itself.
Unless you specify a different optimizer (`--optimizer`) [Adam](https://arxiv.org/abs/1412.6980) will be used.
Additionally, you can control the batch size (`--batch-size`), the learning rate schedule (`--learning-rate-schedule`) and other parameters relevant for training.
Expand Down
176 changes: 176 additions & 0 deletions docs/tutorials/wmt_large.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
# Large Data: WMT 2018 German-English

This tutorial covers training a Sockeye model using an arbitrarily large amount of data.
We use the data provided for the [WMT 2018](http://www.statmt.org/wmt18/translation-task.html) German-English news task (41 million parallel sentences), though similar settings could be used for even larger data sets.

## Setup

**NOTE**: This build assumes that 4 local GPUs are available.

For this tutorial, we use the Sockeye Docker image.

1. Follow the linked instructions to install [nvidia-docker](https://github.com/NVIDIA/nvidia-docker).

2. Build the Docker image and record the commit used as the tag:

```bash
python3 sockeye_contrib/docker/build.py

export TAG=$(git rev-parse --short HEAD)
```

3. This tutorial uses two external pieces of software, the [subword-nmt](https://github.com/rsennrich/subword-nmt) tool that implements byte-pair encoding (BPE) and the [langid.py](https://github.com/saffsd/langid.py) tool that performs language identification:

```bash
git clone https://github.com/rsennrich/subword-nmt.git
export PYTHONPATH=$(pwd)/subword-nmt:$PYTHONPATH

git clone https://github.com/saffsd/langid.py.git
export PYTHONPATH=$(pwd)/langid.py:$PYTHONPATH
```

4. We also recommend installing [GNU Parallel](https://www.gnu.org/software/parallel/) to speed up preprocessing steps (run `apt-get install parallel` or `yum install parallel`).

## Data

We use the preprocessed data provided for the WMT 2018 news translation shared task.
Download and extract the data using the following commands:

```bash
wget http://data.statmt.org/wmt18/translation-task/preprocessed/de-en/corpus.gz
wget http://data.statmt.org/wmt18/translation-task/preprocessed/de-en/dev.tgz
zcat corpus.gz |cut -f1 >corpus.de
zcat corpus.gz |cut -f2 >corpus.en
tar xvzf dev.tgz '*.en' '*.de'
```

## Preprocessing

The data has already been tokenized and true-cased, however no significant corpus cleaning is applied.
The majority of the data is taken from inherently noisy web-crawls (sentence pairs are not always in the correct language, or even natural language text).
If we were participating in the WMT evaluation, we would spend a substantial amount of effort selecting clean training data from the noisy corpus.
For this tutorial, we run a simple cleaning step that retains sentence pairs for which a language identification model classifies the target side as English.
The use of GNU Parallel is optional, but makes this step much faster:

```bash
parallel --pipe --keep-order \
python -m langid.langid --line -l en,de <corpus.en >corpus.en.langid

paste corpus.en.langid corpus.de |grep "^('en" |cut -f2 >corpus.de.clean
paste corpus.en.langid corpus.en |grep "^('en" |cut -f2 >corpus.en.clean
```

We next use BPE to learn a joint sub-word vocabulary from the clean training data.
To speed up this step, we use random samples of the source and target data (note that these samples will not be parallel, but BPE training does not require parallel data).

```bash
shuf -n 1000000 corpus.de.clean >corpus.de.clean.sample
shuf -n 1000000 corpus.en.clean >corpus.en.clean.sample

python -m subword_nmt.learn_joint_bpe_and_vocab \
--input corpus.de.clean.sample corpus.en.clean.sample \
-s 32000 \
-o bpe.codes \
--write-vocabulary bpe.vocab.de bpe.vocab.en
```

We use this vocabulary to encode our training, validation, and test data.
For simplicity, we use the 2016 data for validation and 2017 data for test.
GNU parallel can also significantly speed up this step.

```bash
parallel --pipe --keep-order \
python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.de --vocabulary-threshold 50 <corpus.de.clean >corpus.de.clean.bpe
parallel --pipe --keep-order \
python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.en --vocabulary-threshold 50 <corpus.en.clean >corpus.en.clean.bpe

python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.de --vocabulary-threshold 50 <newstest2016.tc.de >newstest2016.tc.de.bpe
python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.en --vocabulary-threshold 50 <newstest2016.tc.en >newstest2016.tc.en.bpe

python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.de --vocabulary-threshold 50 <newstest2017.tc.de >newstest2017.tc.de.bpe
python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.en --vocabulary-threshold 50 <newstest2017.tc.en >newstest2017.tc.en.bpe
```

## Training

Now that our data is cleaned and sub-word encoded, we are almost ready to start model training.
We first run a data preparation step that splits the training data into shards and serializes it in MXNet's NDArray format.
This allows us to train on data of any size by efficiently loading and unloading different pieces during training:

```bash
nvidia-docker run --rm -i -v $(pwd):/work -w /work sockeye:$TAG \
python -m sockeye.prepare_data \
-s corpus.de.clean.bpe \
-t corpus.en.clean.bpe \
-o prepared_data \
--shared-vocab \
--word-min-count 2 \
--max-seq-len 99 \
--num-samples-per-shard 10000000 \
--seed 1
```

We then start Sockeye training:

```bash
nvidia-docker run --rm -i -v $(pwd):/work -w /work -e OMP_NUM_THREADS=4 sockeye:$TAG \
python -m sockeye.train \
-d prepared_data \
-vs newstest2016.tc.de.bpe \
-vt newstest2016.tc.en.bpe \
-o model \
--num-layers 6 \
--transformer-model-size 512 \
--transformer-attention-heads 8 \
--transformer-feed-forward-num-hidden 2048 \
--weight-tying \
--weight-tying-type src_trg_softmax \
--optimizer adam \
--batch-size 8192 \
--checkpoint-interval 4000 \
--initial-learning-rate 0.0002 \
--learning-rate-reduce-factor 0.9 \
--learning-rate-reduce-num-not-improved 8 \
--max-num-checkpoint-not-improved 60 \
--decode-and-evaluate 500 \
--device-ids -4 \
--seed 1
```

This trains a "base" [Transformer](https://arxiv.org/abs/1706.03762) model using the [Adam](https://arxiv.org/abs/1412.6980) optimizer with a batch size of 8192 tokens.
The learning rate will automatically reduce when validation perplexity does not improve for 8 checkpoints (4000 batches per checkpoint) and training will conclude when validation perplexity does not improve for 60 checkpoints.
At each checkpoint, Sockeye runs a separate decoder process to evaluate metrics such as BLEU on a sample of the validation data (500 sentences).
Note that these scores are calculated on the tokens provided to Sockeye, e.g. in this tutorial BLEU will be calculated on the sub-words we created above.

Training this model takes around 100 hours (25 epochs) on 4 NVIDIA Tesla V100-SXM2-16GB GPUs.
Training perplexity reaches ~4.45 and validation perplexity reaches ~3.05.

## Evaluation

Now the model is ready to translate data.
Input should be preprocessed identically to the training data, including sub-word encoding (BPE).
Run the following to translate the test set that we've already preprocessed:

```bash
nvidia-docker run --rm -i -v $(pwd):/work -w /work sockeye:$TAG \
python -m sockeye.translate \
-i newstest2017.tc.de.bpe \
-o newstest2017.tc.hyp.bpe \
-m model \
--beam-size 5 \
--batch-size 64 \
--device-ids -1
```

To evaluate the translations, reverse the BPE sub-word encoding and run [sacreBLEU](https://github.com/mjpost/sacreBLEU) to compute the BLEU score:

```bash
sed -re 's/(@@ |@@$)//g' <newstest2017.tc.hyp.bpe >newstest2017.tc.hyp

nvidia-docker run --rm -i -v $(pwd):/work -w /work sockeye:$TAG \
sacrebleu newstest2017.tc.en -tok none -i newstest2017.tc.hyp
```

The result should be near 36 BLEU.
Note that this is tokenized, normalized, and true-cased data.
If we were actually participating in WMT, the translations would need to be recased and detokenized for human evaluation.

0 comments on commit acb0815

Please sign in to comment.