Merge internal changes (facebookresearch#283)

Summary: Pull Request resolved: pytorch/translate#283 Pull Request resolved: facebookresearch#428 Differential Revision: D13564190 Pulled By: myleott fbshipit-source-id: 3b62282d7069c288f5bdd1dd2c120788cee4abb5
shdut · Jan 5, 2019 · 7633129 · 7633129
1 parent 0cb8713
commit 7633129
Show file tree

Hide file tree

Showing 59 changed files with 839 additions and 557 deletions.
diff --git a/README.md b/README.md
@@ -19,10 +19,13 @@ of various sequence-to-sequence models, including:
 
 Fairseq features:
 - multi-GPU (distributed) training on one machine or across multiple machines
-- fast beam search generation on both CPU and GPU
+- fast generation on both CPU and GPU with multiple search algorithms implemented:
+  - beam search
+  - Diverse Beam Search ([Vijayakumar et al., 2016](https://arxiv.org/abs/1610.02424))
+  - sampling (unconstrained and top-k)
 - large mini-batch training even on a single GPU via delayed updates
 - fast half-precision floating point (FP16) training
-- extensible: easily register new models, criterions, and tasks
+- extensible: easily register new models, criterions, tasks, optimizers and learning rate schedulers
 
 We also provide [pre-trained models](#pre-trained-models) for several benchmark
 translation and language modeling datasets.
@@ -34,7 +37,7 @@ translation and language modeling datasets.
 * For training new models, you'll also need an NVIDIA GPU and [NCCL](https://github.com/NVIDIA/nccl)
 * Python version 3.6
 
-Currently fairseq requires PyTorch version >= 0.4.0.
+Currently fairseq requires PyTorch version >= 1.0.0.
 Please follow the instructions here: https://github.com/pytorch/pytorch#installation.
 
 If you use Docker make sure to increase the shared memory size either with

diff --git a/distributed_train.py b/distributed_train.py
diff --git a/docs/criterions.rst b/docs/criterions.rst
@@ -6,8 +6,26 @@
 Criterions
 ==========
 
+Criterions compute the loss function given the model and batch, roughly::
+
+  loss = criterion(model, batch)
+
 .. automodule:: fairseq.criterions
     :members:
+
 .. autoclass:: fairseq.criterions.FairseqCriterion
     :members:
     :undoc-members:
+
+.. autoclass:: fairseq.criterions.adaptive_loss.AdaptiveLoss
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.criterions.composite_loss.CompositeLoss
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.criterions.cross_entropy.CrossEntropyCriterion
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.criterions.label_smoothed_cross_entropy.LabelSmoothedCrossEntropyCriterion
+    :members:
+    :undoc-members:
diff --git a/docs/data.rst b/docs/data.rst
@@ -21,6 +21,20 @@ mini-batches.
 .. autoclass:: fairseq.data.MonolingualDataset
     :members:
 
+**Helper Datasets**
+
+These datasets wrap other :class:`fairseq.data.FairseqDataset` instances and
+provide additional functionality:
+
+.. autoclass:: fairseq.data.BacktranslationDataset
+    :members:
+.. autoclass:: fairseq.data.ConcatDataset
+    :members:
+.. autoclass:: fairseq.data.RoundRobinZipDatasets
+    :members:
+.. autoclass:: fairseq.data.TransformEosDataset
+    :members:
+
 
 Dictionary
 ----------
@@ -32,6 +46,8 @@ Dictionary
 Iterators
 ---------
 
+.. autoclass:: fairseq.data.BufferedIterator
+    :members:
 .. autoclass:: fairseq.data.CountingIterator
     :members:
 .. autoclass:: fairseq.data.EpochBatchIterator

diff --git a/docs/getting_started.rst b/docs/getting_started.rst
@@ -27,21 +27,20 @@ interactively. Here, we use a beam size of 5:
     > MODEL_DIR=wmt14.en-fr.fconv-py
     > python interactive.py \
         --path $MODEL_DIR/model.pt $MODEL_DIR \
-        --beam 5
+        --beam 5 --source-lang en --target-lang fr
     | loading model(s) from wmt14.en-fr.fconv-py/model.pt
     | [en] dictionary: 44206 types
     | [fr] dictionary: 44463 types
     | Type the input sentence and press return:
     > Why is it rare to discover new marine mam@@ mal species ?
     O       Why is it rare to discover new marine mam@@ mal species ?
-    H       -0.06429661810398102    Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins ?
-    A       0 1 3 3 5 6 6 8 8 8 7 11 12
-
-This generation script produces four types of outputs: a line prefixed
-with *S* shows the supplied source sentence after applying the
-vocabulary; *O* is a copy of the original source sentence; *H* is the
-hypothesis along with an average log-likelihood; and *A* is the
-attention maxima for each word in the hypothesis, including the
+    H       -0.1525060087442398     Pourquoi est @-@ il rare de découvrir de nouvelles espèces de mammifères marins ?
+    P       -0.2221 -0.3122 -0.1289 -0.2673 -0.1711 -0.1930 -0.1101 -0.1660 -0.1003 -0.0740 -0.1101 -0.0814 -0.1238 -0.0985 -0.1288
+
+This generation script produces three types of outputs: a line prefixed
+with *O* is a copy of the original source sentence; *H* is the
+hypothesis along with an average log-likelihood; and *P* is the
+positional score per token position, including the
 end-of-sentence marker which is omitted from the text.
 
 See the `README <https://github.com/pytorch/fairseq#pre-trained-models>`__ for a

diff --git a/docs/lr_scheduler.rst b/docs/lr_scheduler.rst
@@ -6,7 +6,29 @@
 Learning Rate Schedulers
 ========================
 
-TODO
+Learning Rate Schedulers update the learning rate over the course of training.
+Learning rates can be updated after each update via :func:`step_update` or at
+epoch boundaries via :func:`step`.
 
 .. automodule:: fairseq.optim.lr_scheduler
     :members:
+
+.. autoclass:: fairseq.optim.lr_scheduler.FairseqLRScheduler
+    :members:
+    :undoc-members:
+
+.. autoclass:: fairseq.optim.lr_scheduler.cosine_lr_scheduler.CosineSchedule
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.lr_scheduler.fixed_schedule.FixedSchedule
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.lr_scheduler.inverse_square_root_schedule.InverseSquareRootSchedule
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.lr_scheduler.reduce_lr_on_plateau.ReduceLROnPlateau
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.lr_scheduler.reduce_angular_lr_scheduler.TriangularSchedule
+    :members:
+    :undoc-members:
diff --git a/docs/modules.rst b/docs/modules.rst
@@ -1,8 +1,8 @@
 Modules
 =======
 
-Fairseq provides several stand-alone :class:`torch.nn.Module` s that may be
-helpful when implementing a new :class:`FairseqModel`.
+Fairseq provides several stand-alone :class:`torch.nn.Module` classes that may
+be helpful when implementing a new :class:`~fairseq.models.FairseqModel`.
 
 .. automodule:: fairseq.modules
     :members:

diff --git a/docs/optim.rst b/docs/optim.rst
@@ -6,5 +6,27 @@
 Optimizers
 ==========
 
+Optimizers update the Model parameters based on the gradients.
+
 .. automodule:: fairseq.optim
     :members:
+
+.. autoclass:: fairseq.optim.FairseqOptimizer
+    :members:
+    :undoc-members:
+
+.. autoclass:: fairseq.optim.adagrad.Adagrad
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.adam.FairseqAdam
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.fp16_optimizer.FP16Optimizer
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.nag.FairseqNAG
+    :members:
+    :undoc-members:
+.. autoclass:: fairseq.optim.sgd.SGD
+    :members:
+    :undoc-members:
diff --git a/docs/overview.rst b/docs/overview.rst
@@ -22,12 +22,18 @@ fairseq implements the following high-level training flow::
   for epoch in range(num_epochs):
       itr = task.get_batch_iterator(task.dataset('train'))
       for num_updates, batch in enumerate(itr):
-          loss = criterion(model, batch)
-          optimizer.backward(loss)
+          task.train_step(batch, model, criterion, optimizer)
+          average_and_clip_gradients()
           optimizer.step()
           lr_scheduler.step_update(num_updates)
       lr_scheduler.step(epoch)
 
+where the default implementation for ``train.train_step`` is roughly::
+
+  def train_step(self, batch, model, criterion, optimizer):
+      loss = criterion(model, batch)
+      optimizer.backward(loss)
+
 **Registering new plug-ins**
 
 New plug-ins are *registered* through a set of ``@register`` function

diff --git a/docs/tutorial_classifying_names.rst b/docs/tutorial_classifying_names.rst
@@ -353,17 +353,16 @@ The model files should appear in the :file:`checkpoints/` directory.
 -------------------------------
 
 Finally we can write a short script to evaluate our model on new inputs. Create
-a new file named :file:`eval_classify.py` with the following contents::
+a new file named :file:`eval_classifier.py` with the following contents::
 
   from fairseq import data, options, tasks, utils
   from fairseq.tokenizer import Tokenizer
 
   # Parse command-line arguments for generation
-  parser = options.get_generation_parser()
+  parser = options.get_generation_parser(default_task='simple_classification')
   args = options.parse_args_and_arch(parser)
 
   # Setup task
-  args.task = 'simple_classification'
   task = tasks.setup_task(args)
 
   # Load model

diff --git a/eval_lm.py b/eval_lm.py
@@ -55,7 +55,9 @@ def main(parsed_args):
 
     # Load ensemble
     print('| loading model(s) from {}'.format(parsed_args.path))
-    models, args = utils.load_ensemble_for_inference(parsed_args.path.split(':'), task, model_arg_overrides=eval(parsed_args.model_overrides))
+    models, args = utils.load_ensemble_for_inference(
+        parsed_args.path.split(':'), task, model_arg_overrides=eval(parsed_args.model_overrides),
+    )
 
     for arg in vars(parsed_args).keys():
         if arg not in {'self_target', 'future_target', 'past_target', 'tokens_per_sample', 'output_size_dictionary'}:
@@ -83,9 +85,10 @@ def main(parsed_args):
         max_positions=utils.resolve_max_positions(*[
             model.max_positions() for model in models
         ]),
+        ignore_invalid_inputs=True,
         num_shards=args.num_shards,
         shard_id=args.shard_id,
-        ignore_invalid_inputs=True,
+        num_workers=args.num_workers,
     ).next_epoch_itr(shuffle=False)
 
     gen_timer = StopwatchMeter()

diff --git a/fairseq/data/__init__.py b/fairseq/data/__init__.py
@@ -9,7 +9,7 @@
 from .fairseq_dataset import FairseqDataset
 from .backtranslation_dataset import BacktranslationDataset
 from .concat_dataset import ConcatDataset
-from .indexed_dataset import IndexedDataset, IndexedCachedDataset, IndexedInMemoryDataset, IndexedRawTextDataset
+from .indexed_dataset import IndexedCachedDataset, IndexedDataset, IndexedRawTextDataset
 from .language_pair_dataset import LanguagePairDataset
 from .monolingual_dataset import MonolingualDataset
 from .round_robin_zip_datasets import RoundRobinZipDatasets
@@ -33,7 +33,6 @@
     'GroupedIterator',
     'IndexedCachedDataset',
     'IndexedDataset',
-    'IndexedInMemoryDataset',
     'IndexedRawTextDataset',
     'LanguagePairDataset',
     'MonolingualDataset',

diff --git a/fairseq/data/backtranslation_dataset.py b/fairseq/data/backtranslation_dataset.py
@@ -56,6 +56,28 @@ def update_sample(sample, generated_source):
 
 
 class BacktranslationDataset(FairseqDataset):
+    """
+    Sets up a backtranslation dataset which takes a tgt batch, generates
+    a src using a tgt-src backtranslation function (*backtranslation_fn*),
+    and returns the corresponding `{generated src, input tgt}` batch.
+
+    Args:
+        tgt_dataset (~fairseq.data.FairseqDataset): the dataset to be
+            backtranslated. Only the source side of this dataset will be used.
+            After backtranslation, the source sentences in this dataset will be
+            returned as the targets.
+        backtranslation_fn (callable): function to call to generate
+            backtranslations. This is typically the `generate` method of a
+            :class:`~fairseq.sequence_generator.SequenceGenerator` object.
+        max_len_a, max_len_b (int, int): will be used to compute
+            `maxlen = max_len_a * src_len + max_len_b`, which will be passed
+            into *backtranslation_fn*.
+        output_collater (callable, optional): function to call on the
+            backtranslated samples to create the final batch
+            (default: ``tgt_dataset.collater``).
+        cuda: use GPU for generation
+    """
+
     def __init__(
         self,
         tgt_dataset,
@@ -66,27 +88,6 @@ def __init__(
         cuda=True,
         **kwargs
     ):
-        """
-        Sets up a backtranslation dataset which takes a tgt batch, generates
-        a src using a tgt-src backtranslation function (*backtranslation_fn*),
-        and returns the corresponding `{generated src, input tgt}` batch.
-
-        Args:
-            tgt_dataset (~fairseq.data.FairseqDataset): the dataset to be
-                backtranslated. Only the source side of this dataset will be
-                used. After backtranslation, the source sentences in this
-                dataset will be returned as the targets.
-            backtranslation_fn (callable): function to call to generate
-                backtranslations. This is typically the `generate` method of a
-                :class:`~fairseq.sequence_generator.SequenceGenerator` object.
-            max_len_a, max_len_b (int, int): will be used to compute
-                `maxlen = max_len_a * src_len + max_len_b`, which will be
-                passed into *backtranslation_fn*.
-            output_collater (callable, optional): function to call on the
-                backtranslated samples to create the final batch (default:
-                ``tgt_dataset.collater``)
-            cuda: use GPU for generation
-        """
         self.tgt_dataset = tgt_dataset
         self.backtranslation_fn = backtranslation_fn
         self.max_len_a = max_len_a
@@ -166,11 +167,10 @@ def size(self, index):
         """
         tgt_size = self.tgt_dataset.size(index)[0]
         return (tgt_size, tgt_size)
-    
+
     @property
     def supports_prefetch(self):
-        return self.tgt_dataset.supports_prefetch()
+        return getattr(self.tgt_dataset, 'supports_prefetch', False)
 
     def prefetch(self, indices):
         return self.tgt_dataset.prefetch(indices)
-