diff --git a/CHANGELOG.md b/CHANGELOG.md index bb0e23134..50d1293fd 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -17,6 +17,7 @@ Each version section may have have subsections for: _Added_, _Changed_, _Removed - Update to [MXNet 1.5.0](https://github.com/apache/incubator-mxnet/tree/1.5.0) - Moved `SockeyeModel` implementation and all layers to [Gluon API](http://mxnet.incubator.apache.org/versions/master/gluon/index.html) - Removed support for Python 3.4. +- Removed image captioning module - Removed outdated Autopilot module - Removed unused training options: Eve, Nadam, RMSProp, Nag, Adagrad, and Adadelta optimizers, `fixed-step` and `fixed-rate-inv-t` learning rate schedulers - Updated and renamed learning rate scheduler `fixed-rate-inv-sqrt-t` -> `inv-sqrt-decay` diff --git a/README.md b/README.md index 3f42b36a6..010104195 100644 --- a/README.md +++ b/README.md @@ -30,7 +30,9 @@ See the [Dockerfile documentation](sockeye_contrib/docker) for more information. ## Documentation For information on how to use Sockeye, please visit [our documentation](https://awslabs.github.io/sockeye/). -Developers may be interested in our [developer guidelines](https://awslabs.github.io/sockeye/development.html). + +- For a quickstart guide to training a large data WMT model, see the [WMT 2018 German-English tutorial](https://awslabs.github.io/sockeye/tutorials/wmt_large.html). +- Developers may be interested in our [developer guidelines](https://awslabs.github.io/sockeye/development.html). ## Citation diff --git a/docs/index.md b/docs/index.md index 43ed555cf..6d48f7b6c 100644 --- a/docs/index.md +++ b/docs/index.md @@ -13,15 +13,11 @@ layout: default This is the documentation for Sockeye, a sequence-to-sequence framework for Neural Machine Translation based on Apache MXNet Incubating. It implements state-of-the-art encoder-decoder architectures, such as -- Deep Recurrent Neural Networks with Attention [[Bahdanau, '14](https://arxiv.org/abs/1409.0473)] - Transformer Models with self-attention [[Vaswani et al, '17](https://arxiv.org/abs/1706.03762)] -- Fully convolutional sequence-to-sequence models [[Gehring et al, '17](https://arxiv.org/abs/1705.03122)] - -In addition, this framework provides an experimental [image-to-description module](https://github.com/awslabs/sockeye/tree/master/sockeye/image_captioning) that can be used for [image captioning](image_captioning.html). Recent developments and changes are tracked in our [CHANGELOG](https://github.com/awslabs/sockeye/blob/master/CHANGELOG.md). -If you are interested in collaborating or have any questions, please submit a pull request or [issue](https://github.com/awslabs/sockeye/issues/new). +If you are interested in collaborating or have any questions, please submit a pull request or [issue](https://github.com/awslabs/sockeye/issues/new). You can also send questions to *sockeye-dev-at-amazon-dot-com*. Developers may be interested in [our developer guidelines](development.html). diff --git a/docs/sockeye_captioning.bib b/docs/sockeye_captioning.bib deleted file mode 100644 index 4c26cffb1..000000000 --- a/docs/sockeye_captioning.bib +++ /dev/null @@ -1,12 +0,0 @@ -@article{SockeyeCaptioning:18, - author = {Bazzani, Loris and Domhan, Tobias and Hieber, Felix}, - title = "{Image Captioning as Neural Machine Translation Task in SOCKEYE}", - journal = {arXiv preprint arXiv:1810.04101}, -archivePrefix = "arXiv", - eprint = {1810.04101}, - primaryClass = "cs.CV", - keywords = {Computer Science - Computer Vision and Pattern Recognition}, - year = 2018, - month = oct, - url = {https://arxiv.org/abs/1810.04101} -} diff --git a/docs/tutorials.md b/docs/tutorials.md index 8c6d7bae5..372513137 100644 --- a/docs/tutorials.md +++ b/docs/tutorials.md @@ -13,3 +13,4 @@ introduce different concepts and parameters used for training and translation. 1. [Sequence copy task](tutorials/seqcopy.html) 1. [WMT German to English news translation](tutorials/wmt.html) 1. [Domain adaptation of NMT models](tutorials/adapt.html) +1. [Large data: WMT German-English 2018](tutorials/wmt_large.html) diff --git a/docs/tutorials/wmt.md b/docs/tutorials/wmt.md index 19ec7c505..3e608c905 100644 --- a/docs/tutorials/wmt.md +++ b/docs/tutorials/wmt.md @@ -16,7 +16,7 @@ git clone https://github.com/rsennrich/subword-nmt.git export PYTHONPATH=$(pwd)/subword-nmt:$PYTHONPATH ``` -We will visualize training progress using Tensorboard and its MXNet adaptor, `mxboard`. +We will visualize training progress using Tensorboard and its MXNet adaptor, `mxboard`. Install it using: ```bash pip install tensorboard mxboard @@ -95,24 +95,13 @@ We can now kick off the training process: python -m sockeye.train -d train_data \ -vs newstest2016.tc.BPE.de \ -vt newstest2016.tc.BPE.en \ - --encoder rnn \ - --decoder rnn \ - --num-embed 256 \ - --rnn-num-hidden 512 \ - --rnn-attention-type dot \ --max-seq-len 60 \ --decode-and-evaluate 500 \ --use-cpu \ -o wmt_model ``` -This will train a 1-layer bi-LSTM encoder, 1-layer LSTM decoder with dot attention. -Sockeye offers a whole variety of different options regarding the model architecture, -such as stacked RNNs with residual connections (`--num-layers`, `--rnn-residual-connections`), -[Transformer](https://arxiv.org/abs/1706.03762) encoder and decoder (`--encoder transformer`, `--decoder transformer`), -[ConvS2S](https://arxiv.org/pdf/1705.03122) (`--encoder cnn`, `--decoder cnn`), -various RNN (`--rnn-cell-type`) and attention (`--attention-type`) types and more. - +This will train a "base" [Transformer](https://arxiv.org/abs/1706.03762) model. There are also several parameters controlling training itself. Unless you specify a different optimizer (`--optimizer`) [Adam](https://arxiv.org/abs/1412.6980) will be used. Additionally, you can control the batch size (`--batch-size`), the learning rate schedule (`--learning-rate-schedule`) and other parameters relevant for training. diff --git a/docs/tutorials/wmt_large.md b/docs/tutorials/wmt_large.md new file mode 100644 index 000000000..2be6af35e --- /dev/null +++ b/docs/tutorials/wmt_large.md @@ -0,0 +1,176 @@ +# Large Data: WMT 2018 German-English + +This tutorial covers training a Sockeye model using an arbitrarily large amount of data. +We use the data provided for the [WMT 2018](http://www.statmt.org/wmt18/translation-task.html) German-English news task (41 million parallel sentences), though similar settings could be used for even larger data sets. + +## Setup + +**NOTE**: This build assumes that 4 local GPUs are available. + +For this tutorial, we use the Sockeye Docker image. + +1. Follow the linked instructions to install [nvidia-docker](https://github.com/NVIDIA/nvidia-docker). + +2. Build the Docker image and record the commit used as the tag: + +```bash +python3 sockeye_contrib/docker/build.py + +export TAG=$(git rev-parse --short HEAD) +``` + +3. This tutorial uses two external pieces of software, the [subword-nmt](https://github.com/rsennrich/subword-nmt) tool that implements byte-pair encoding (BPE) and the [langid.py](https://github.com/saffsd/langid.py) tool that performs language identification: + +```bash +git clone https://github.com/rsennrich/subword-nmt.git +export PYTHONPATH=$(pwd)/subword-nmt:$PYTHONPATH + +git clone https://github.com/saffsd/langid.py.git +export PYTHONPATH=$(pwd)/langid.py:$PYTHONPATH +``` + +4. We also recommend installing [GNU Parallel](https://www.gnu.org/software/parallel/) to speed up preprocessing steps (run `apt-get install parallel` or `yum install parallel`). + +## Data + +We use the preprocessed data provided for the WMT 2018 news translation shared task. +Download and extract the data using the following commands: + +```bash +wget http://data.statmt.org/wmt18/translation-task/preprocessed/de-en/corpus.gz +wget http://data.statmt.org/wmt18/translation-task/preprocessed/de-en/dev.tgz +zcat corpus.gz |cut -f1 >corpus.de +zcat corpus.gz |cut -f2 >corpus.en +tar xvzf dev.tgz '*.en' '*.de' +``` + +## Preprocessing + +The data has already been tokenized and true-cased, however no significant corpus cleaning is applied. +The majority of the data is taken from inherently noisy web-crawls (sentence pairs are not always in the correct language, or even natural language text). +If we were participating in the WMT evaluation, we would spend a substantial amount of effort selecting clean training data from the noisy corpus. +For this tutorial, we run a simple cleaning step that retains sentence pairs for which a language identification model classifies the target side as English. +The use of GNU Parallel is optional, but makes this step much faster: + +```bash +parallel --pipe --keep-order \ + python -m langid.langid --line -l en,de corpus.en.langid + +paste corpus.en.langid corpus.de |grep "^('en" |cut -f2 >corpus.de.clean +paste corpus.en.langid corpus.en |grep "^('en" |cut -f2 >corpus.en.clean +``` + +We next use BPE to learn a joint sub-word vocabulary from the clean training data. +To speed up this step, we use random samples of the source and target data (note that these samples will not be parallel, but BPE training does not require parallel data). + +```bash +shuf -n 1000000 corpus.de.clean >corpus.de.clean.sample +shuf -n 1000000 corpus.en.clean >corpus.en.clean.sample + +python -m subword_nmt.learn_joint_bpe_and_vocab \ + --input corpus.de.clean.sample corpus.en.clean.sample \ + -s 32000 \ + -o bpe.codes \ + --write-vocabulary bpe.vocab.de bpe.vocab.en +``` + +We use this vocabulary to encode our training, validation, and test data. +For simplicity, we use the 2016 data for validation and 2017 data for test. +GNU parallel can also significantly speed up this step. + +```bash +parallel --pipe --keep-order \ + python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.de --vocabulary-threshold 50 corpus.de.clean.bpe +parallel --pipe --keep-order \ + python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.en --vocabulary-threshold 50 corpus.en.clean.bpe + +python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.de --vocabulary-threshold 50 newstest2016.tc.de.bpe +python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.en --vocabulary-threshold 50 newstest2016.tc.en.bpe + +python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.de --vocabulary-threshold 50 newstest2017.tc.de.bpe +python -m subword_nmt.apply_bpe -c bpe.codes --vocabulary bpe.vocab.en --vocabulary-threshold 50 newstest2017.tc.en.bpe +``` + +## Training + +Now that our data is cleaned and sub-word encoded, we are almost ready to start model training. +We first run a data preparation step that splits the training data into shards and serializes it in MXNet's NDArray format. +This allows us to train on data of any size by efficiently loading and unloading different pieces during training: + +```bash +nvidia-docker run --rm -i -v $(pwd):/work -w /work sockeye:$TAG \ + python -m sockeye.prepare_data \ + -s corpus.de.clean.bpe \ + -t corpus.en.clean.bpe \ + -o prepared_data \ + --shared-vocab \ + --word-min-count 2 \ + --max-seq-len 99 \ + --num-samples-per-shard 10000000 \ + --seed 1 +``` + +We then start Sockeye training: + +```bash +nvidia-docker run --rm -i -v $(pwd):/work -w /work -e OMP_NUM_THREADS=4 sockeye:$TAG \ + python -m sockeye.train \ + -d prepared_data \ + -vs newstest2016.tc.de.bpe \ + -vt newstest2016.tc.en.bpe \ + -o model \ + --num-layers 6 \ + --transformer-model-size 512 \ + --transformer-attention-heads 8 \ + --transformer-feed-forward-num-hidden 2048 \ + --weight-tying \ + --weight-tying-type src_trg_softmax \ + --optimizer adam \ + --batch-size 8192 \ + --checkpoint-interval 4000 \ + --initial-learning-rate 0.0002 \ + --learning-rate-reduce-factor 0.9 \ + --learning-rate-reduce-num-not-improved 8 \ + --max-num-checkpoint-not-improved 60 \ + --decode-and-evaluate 500 \ + --device-ids -4 \ + --seed 1 +``` + +This trains a "base" [Transformer](https://arxiv.org/abs/1706.03762) model using the [Adam](https://arxiv.org/abs/1412.6980) optimizer with a batch size of 8192 tokens. +The learning rate will automatically reduce when validation perplexity does not improve for 8 checkpoints (4000 batches per checkpoint) and training will conclude when validation perplexity does not improve for 60 checkpoints. +At each checkpoint, Sockeye runs a separate decoder process to evaluate metrics such as BLEU on a sample of the validation data (500 sentences). +Note that these scores are calculated on the tokens provided to Sockeye, e.g. in this tutorial BLEU will be calculated on the sub-words we created above. + +Training this model takes around 100 hours (25 epochs) on 4 NVIDIA Tesla V100-SXM2-16GB GPUs. +Training perplexity reaches ~4.45 and validation perplexity reaches ~3.05. + +## Evaluation + +Now the model is ready to translate data. +Input should be preprocessed identically to the training data, including sub-word encoding (BPE). +Run the following to translate the test set that we've already preprocessed: + +```bash +nvidia-docker run --rm -i -v $(pwd):/work -w /work sockeye:$TAG \ + python -m sockeye.translate \ + -i newstest2017.tc.de.bpe \ + -o newstest2017.tc.hyp.bpe \ + -m model \ + --beam-size 5 \ + --batch-size 64 \ + --device-ids -1 +``` + +To evaluate the translations, reverse the BPE sub-word encoding and run [sacreBLEU](https://github.com/mjpost/sacreBLEU) to compute the BLEU score: + +```bash +sed -re 's/(@@ |@@$)//g' newstest2017.tc.hyp + +nvidia-docker run --rm -i -v $(pwd):/work -w /work sockeye:$TAG \ + sacrebleu newstest2017.tc.en -tok none -i newstest2017.tc.hyp +``` + +The result should be near 36 BLEU. +Note that this is tokenized, normalized, and true-cased data. +If we were actually participating in WMT, the translations would need to be recased and detokenized for human evaluation.