Sockeye 2 Documentation Update (awslabs#722)

* Documentation update * Update large data tutorial * WMT large update
AmitMY · Aug 29, 2019 · acb0815 · acb0815
1 parent 26cbc97
commit acb0815
Show file tree

Hide file tree

Showing 7 changed files with 184 additions and 31 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -17,6 +17,7 @@ Each version section may have have subsections for: _Added_, _Changed_, _Removed
 - Update to [MXNet 1.5.0](https://github.com/apache/incubator-mxnet/tree/1.5.0)
 - Moved `SockeyeModel` implementation and all layers to [Gluon API](http://mxnet.incubator.apache.org/versions/master/gluon/index.html)
 - Removed support for Python 3.4.
+- Removed image captioning module
 - Removed outdated Autopilot module
 - Removed unused training options: Eve, Nadam, RMSProp, Nag, Adagrad, and Adadelta optimizers, `fixed-step` and `fixed-rate-inv-t` learning rate schedulers
 - Updated and renamed learning rate scheduler `fixed-rate-inv-sqrt-t` -> `inv-sqrt-decay`

diff --git a/README.md b/README.md
@@ -30,7 +30,9 @@ See the [Dockerfile documentation](sockeye_contrib/docker) for more information.
 ## Documentation
 
 For information on how to use Sockeye, please visit [our documentation](https://awslabs.github.io/sockeye/).
-Developers may be interested in our [developer guidelines](https://awslabs.github.io/sockeye/development.html).
+
+- For a quickstart guide to training a large data WMT model, see the [WMT 2018 German-English tutorial](https://awslabs.github.io/sockeye/tutorials/wmt_large.html).
+- Developers may be interested in our [developer guidelines](https://awslabs.github.io/sockeye/development.html).
 
 ## Citation
 

diff --git a/docs/index.md b/docs/index.md
@@ -13,15 +13,11 @@ layout: default
 This is the documentation for Sockeye, a sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet Incubating.
 It implements state-of-the-art encoder-decoder architectures, such as
 
-- Deep Recurrent Neural Networks with Attention [[Bahdanau, '14](https://arxiv.org/abs/1409.0473)]
 - Transformer Models with self-attention [[Vaswani et al, '17](https://arxiv.org/abs/1706.03762)]
-- Fully convolutional sequence-to-sequence models [[Gehring et al, '17](https://arxiv.org/abs/1705.03122)]
-
-In addition, this framework provides an experimental [image-to-description module](https://github.com/awslabs/sockeye/tree/master/sockeye/image_captioning) that can be used for [image captioning](image_captioning.html).
 
 Recent developments and changes are tracked in our [CHANGELOG](https://github.com/awslabs/sockeye/blob/master/CHANGELOG.md).
 
-If you are interested in collaborating or have any questions, please submit a pull request or [issue](https://github.com/awslabs/sockeye/issues/new). 
+If you are interested in collaborating or have any questions, please submit a pull request or [issue](https://github.com/awslabs/sockeye/issues/new).
 You can also send questions to *sockeye-dev-at-amazon-dot-com*.
 Developers may be interested in [our developer guidelines](development.html).
 

diff --git a/docs/sockeye_captioning.bib b/docs/sockeye_captioning.bib
diff --git a/docs/tutorials.md b/docs/tutorials.md
@@ -13,3 +13,4 @@ introduce different concepts and parameters used for training and translation.
 1. [Sequence copy task](tutorials/seqcopy.html)
 1. [WMT German to English news translation](tutorials/wmt.html)
 1. [Domain adaptation of NMT models](tutorials/adapt.html)
+1. [Large data: WMT German-English 2018](tutorials/wmt_large.html)
diff --git a/docs/tutorials/wmt.md b/docs/tutorials/wmt.md
@@ -16,7 +16,7 @@ git clone https://github.com/rsennrich/subword-nmt.git
 export PYTHONPATH=$(pwd)/subword-nmt:$PYTHONPATH
 ```
 
-We will visualize training progress using Tensorboard and its MXNet adaptor, `mxboard`. 
+We will visualize training progress using Tensorboard and its MXNet adaptor, `mxboard`.
 Install it using:
 ```bash
 pip install tensorboard mxboard
@@ -95,24 +95,13 @@ We can now kick off the training process:
 python -m sockeye.train -d train_data \
                         -vs newstest2016.tc.BPE.de \
                         -vt newstest2016.tc.BPE.en \
-                        --encoder rnn \
-                        --decoder rnn \
-                        --num-embed 256 \
-                        --rnn-num-hidden 512 \
-                        --rnn-attention-type dot \
                         --max-seq-len 60 \
                         --decode-and-evaluate 500 \
                         --use-cpu \
                         -o wmt_model
 ```
 
-This will train a 1-layer bi-LSTM encoder, 1-layer LSTM decoder with dot attention.
-Sockeye offers a whole variety of different options regarding the model architecture,
-such as stacked RNNs with residual connections (`--num-layers`, `--rnn-residual-connections`),
-[Transformer](https://arxiv.org/abs/1706.03762) encoder and decoder (`--encoder transformer`, `--decoder transformer`),
-[ConvS2S](https://arxiv.org/pdf/1705.03122) (`--encoder cnn`, `--decoder cnn`),
-various RNN (`--rnn-cell-type`) and attention (`--attention-type`) types and more.  
-
+This will train a "base" [Transformer](https://arxiv.org/abs/1706.03762) model.
 There are also several parameters controlling training itself.
 Unless you specify a different optimizer (`--optimizer`) [Adam](https://arxiv.org/abs/1412.6980) will be used.
 Additionally, you can control the batch size (`--batch-size`), the learning rate schedule (`--learning-rate-schedule`) and other parameters relevant for training.

diff --git a/docs/tutorials/wmt_large.md b/docs/tutorials/wmt_large.md
@@ -0,0 +1,176 @@
+# Large Data: WMT 2018 German-English
+
+This tutorial covers training a Sockeye model using an arbitrarily large amount of data.
+We use the data provided for the [WMT 2018](http://www.statmt.org/wmt18/translation-task.html) German-English news task (41 million parallel sentences), though similar settings could be used for even larger data sets.
+
+## Setup
+
+**NOTE**: This build assumes that 4 local GPUs are available.
+
+For this tutorial, we use the Sockeye Docker image.
+
+1. Follow the linked instructions to install [nvidia-docker](https://github.com/NVIDIA/nvidia-docker).
+
+2. Build the Docker image and record the commit used as the tag:
+
+```bash
+python3 sockeye_contrib/docker/build.py
+
+export TAG=$(git rev-parse --short HEAD)
+```
+
+3. This tutorial uses two external pieces of software, the [subword-nmt](https://github.com/rsennrich/subword-nmt) tool that implements byte-pair encoding (BPE) and the [langid.py](https://github.com/saffsd/langid.py) tool that performs language identification:
+
+```bash
+git clone https://github.com/rsennrich/subword-nmt.git
+export PYTHONPATH=$(pwd)/subword-nmt:$PYTHONPATH
+
+git clone https://github.com/saffsd/langid.py.git
+export PYTHONPATH=$(pwd)/langid.py:$PYTHONPATH
+```
+
+4. We also recommend installing [GNU Parallel](https://www.gnu.org/software/parallel/) to speed up preprocessing steps (run `apt-get install parallel` or `yum install parallel`).
+
+## Data
+
+We use the preprocessed data provided for the WMT 2018 news translation shared task.
+Download and extract the data using the following commands:
+
+```bash
+wget http://data.statmt.org/wmt18/translation-task/preprocessed/de-en/corpus.gz
+wget http://data.statmt.org/wmt18/translation-task/preprocessed/de-en/dev.tgz
+zcat corpus.gz |cut -f1 >corpus.de
+zcat corpus.gz |cut -f2 >corpus.en
+tar xvzf dev.tgz '*.en' '*.de'
+```
+
+## Preprocessing
+
+The data has already been tokenized and true-cased, however no significant corpus cleaning is applied.
+The majority of the data is taken from inherently noisy web-crawls (sentence pairs are not always in the correct language, or even natural language text).
+If we were participating in the WMT evaluation, we would spend a substantial amount of effort selecting clean training data from the noisy corpus.
+For this tutorial, we run a simple cleaning step that retains sentence pairs for which a language identification model classifies the target side as English.
+The use of GNU Parallel is optional, but makes this step much faster:
+
+```bash
+parallel --pipe --keep-order \
+    python -m langid.langid --line -l en,de <corpus.en >corpus.en.langid
+
+paste corpus.en.langid corpus.de |grep "^('en" |cut -f2 >corpus.de.clean
+paste corpus.en.langid corpus.en |grep "^('en" |cut -f2 >corpus.en.clean
+```
+
+We next use BPE to learn a joint sub-word vocabulary from the clean training data.
+To speed up this step, we use random samples of the source and target data (note that these samples will not be parallel, but BPE training does not require parallel data).
+
+```bash
+shuf -n 1000000 corpus.de.clean >corpus.de.clean.sample
+shuf -n 1000000 corpus.en.clean >corpus.en.clean.sample
+
+python -m subword_nmt.learn_joint_bpe_and_vocab \
+    --input corpus.de.clean.sample corpus.en.clean.sample \
+    -s 32000 \
+    -o bpe.codes \
+    --write-vocabulary bpe.vocab.de bpe.vocab.en
+```
+
+We use this vocabulary to encode our training, validation, and test data.
+For simplicity, we use the 2016 data for validation and 2017 data for test.
+GNU parallel can also significantly speed up this step.
+
+```bash
+parallel --pipe --keep-order \
+    python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.de --vocabulary-threshold 50 <corpus.de.clean >corpus.de.clean.bpe
+parallel --pipe --keep-order \
+    python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.en --vocabulary-threshold 50 <corpus.en.clean >corpus.en.clean.bpe
+
+python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.de --vocabulary-threshold 50 <newstest2016.tc.de >newstest2016.tc.de.bpe
+python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.en --vocabulary-threshold 50 <newstest2016.tc.en >newstest2016.tc.en.bpe
+
+python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.de --vocabulary-threshold 50 <newstest2017.tc.de >newstest2017.tc.de.bpe
+python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.en --vocabulary-threshold 50 <newstest2017.tc.en >newstest2017.tc.en.bpe
+```
+
+## Training
+
+Now that our data is cleaned and sub-word encoded, we are almost ready to start model training.
+We first run a data preparation step that splits the training data into shards and serializes it in MXNet's NDArray format.
+This allows us to train on data of any size by efficiently loading and unloading different pieces during training:
+
+```bash
+nvidia-docker run --rm -i -v $(pwd):/work -w /work sockeye:$TAG \
+    python -m sockeye.prepare_data \
+        -s corpus.de.clean.bpe \
+        -t corpus.en.clean.bpe \
+        -o prepared_data \
+        --shared-vocab \
+        --word-min-count 2 \
+        --max-seq-len 99 \
+        --num-samples-per-shard 10000000 \
+        --seed 1
+```
+
+We then start Sockeye training:
+
+```bash
+nvidia-docker run --rm -i -v $(pwd):/work -w /work -e OMP_NUM_THREADS=4 sockeye:$TAG \
+    python -m sockeye.train \
+        -d prepared_data \
+        -vs newstest2016.tc.de.bpe \
+        -vt newstest2016.tc.en.bpe \
+        -o model \
+        --num-layers 6 \
+        --transformer-model-size 512 \
+        --transformer-attention-heads 8 \
+        --transformer-feed-forward-num-hidden 2048 \
+        --weight-tying \
+        --weight-tying-type src_trg_softmax \
+        --optimizer adam \
+        --batch-size 8192 \
+        --checkpoint-interval 4000 \
+        --initial-learning-rate 0.0002 \
+        --learning-rate-reduce-factor 0.9 \
+        --learning-rate-reduce-num-not-improved 8 \
+        --max-num-checkpoint-not-improved 60 \
+        --decode-and-evaluate 500 \
+        --device-ids -4 \
+        --seed 1
+```
+
+This trains a "base" [Transformer](https://arxiv.org/abs/1706.03762) model using the [Adam](https://arxiv.org/abs/1412.6980) optimizer with a batch size of 8192 tokens.
+The learning rate will automatically reduce when validation perplexity does not improve for 8 checkpoints (4000 batches per checkpoint) and training will conclude when validation perplexity does not improve for 60 checkpoints.
+At each checkpoint, Sockeye runs a separate decoder process to evaluate metrics such as BLEU on a sample of the validation data (500 sentences).
+Note that these scores are calculated on the tokens provided to Sockeye, e.g. in this tutorial BLEU will be calculated on the sub-words we created above.
+
+Training this model takes around 100 hours (25 epochs) on 4 NVIDIA Tesla V100-SXM2-16GB GPUs.
+Training perplexity reaches ~4.45 and validation perplexity reaches ~3.05.
+
+## Evaluation
+
+Now the model is ready to translate data.
+Input should be preprocessed identically to the training data, including sub-word encoding (BPE).
+Run the following to translate the test set that we've already preprocessed:
+
+```bash
+nvidia-docker run --rm -i -v $(pwd):/work -w /work sockeye:$TAG \
+    python -m sockeye.translate \
+        -i newstest2017.tc.de.bpe \
+        -o newstest2017.tc.hyp.bpe \
+        -m model \
+        --beam-size 5 \
+        --batch-size 64 \
+        --device-ids -1
+```
+
+To evaluate the translations, reverse the BPE sub-word encoding and run [sacreBLEU](https://github.com/mjpost/sacreBLEU) to compute the BLEU score:
+
+```bash
+sed -re 's/(@@ |@@$)//g' <newstest2017.tc.hyp.bpe >newstest2017.tc.hyp
+
+nvidia-docker run --rm -i -v $(pwd):/work -w /work sockeye:$TAG \
+    sacrebleu newstest2017.tc.en -tok none -i newstest2017.tc.hyp
+```
+
+The result should be near 36 BLEU.
+Note that this is tokenized, normalized, and true-cased data.
+If we were actually participating in WMT, the translations would need to be recased and detokenized for human evaluation.