Code refactor in RNNs (d2l-ai#1897)

* lm dataset + text process * sequence * rnn * stack * stack * update lib * add plot to class Module * rnn scratch * lib * rnn * update rnn * update lib * fix * fix * fix
yfliao · Aug 26, 2021 · f3ec8f2 · f3ec8f2
1 parent 63cafb0
commit f3ec8f2
Show file tree

Hide file tree

Showing 21 changed files with 1,689 additions and 2,158 deletions.
diff --git a/chapter_appendix-tools-for-deep-learning/utils.md b/chapter_appendix-tools-for-deep-learning/utils.md
@@ -47,7 +47,7 @@ def save_hyperparameters(self, ignore=[]):
     frame = inspect.currentframe().f_back
     _, _, _, local_vars = inspect.getargvalues(frame)
     self.hparams = {k:v for k, v in local_vars.items()
-                    if k not in set(ignore+['self'])}
+                    if k not in set(ignore+['self']) and not k.startswith('_')}
     for k, v in self.hparams.items():
         setattr(self, k, v)
 ```
@@ -187,6 +187,17 @@ def train_ch6(net, train_iter, test_iter, num_epochs, lr, device):
           f'test acc {test_acc:.3f}')
     print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
           f'on {str(device)}')
+    
+def grad_clipping(net, theta):  #@save
+    """Clip the gradient."""
+    if isinstance(net, gluon.Block):
+        params = [p.data() for p in net.collect_params().values()]
+    else:
+        params = net.params
+    norm = math.sqrt(sum((p.grad ** 2).sum() for p in params))
+    if norm > theta:
+        for param in params:
+            param.grad[:] *= theta / norm
 ```
 
 ```{.python .input}
@@ -492,7 +503,6 @@ def accuracy(y_hat, y):  #@save
 ```{.python .input}
 %%tab all
 
-%%tab all
 import os
 import requests
 import zipfile
@@ -557,6 +567,11 @@ def download_extract(name, folder=None):  #@save
     return os.path.join(base_dir, folder) if folder else data_dir
 
 
+def tokenize(lines, token='word'):  #@save
+    """Split text lines into word or character tokens."""
+    assert token in ('word', 'char'), 'Unknown token type: ' + token
+    return [line.split() if token == 'word' else list(line) for line in lines]
+
 ```
 
 ```{.python .input}
@@ -583,3 +598,39 @@ def evaluate_loss(net, data_iter, loss):  #@save
         metric.add(d2l.reduce_sum(l), d2l.size(l))
     return metric[0] / metric[1]
 ```
+
+```{.python .input}
+#@tab pytorch
+def grad_clipping(net, theta):  #@save
+    """Clip the gradient."""
+    if isinstance(net, nn.Module):
+        params = [p for p in net.parameters() if p.requires_grad]
+    else:
+        params = net.params
+    norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
+    if norm > theta:
+        for param in params:
+            param.grad[:] *= theta / norm
+```
+
+```{.python .input}
+#@tab tensorflow
+def grad_clipping(grads, theta):  #@save
+    """Clip the gradient."""
+    theta = tf.constant(theta, dtype=tf.float32)
+    new_grad = []
+    for grad in grads:
+        if isinstance(grad, tf.IndexedSlices):
+            new_grad.append(tf.convert_to_tensor(grad))
+        else:
+            new_grad.append(grad)
+    norm = tf.math.sqrt(sum((tf.reduce_sum(grad ** 2)).numpy()
+                        for grad in new_grad))
+    norm = tf.cast(norm, tf.float32)
+    if tf.greater(norm, theta):
+        for i, grad in enumerate(new_grad):
+            new_grad[i] = grad * theta / norm
+    else:
+        new_grad = new_grad
+    return new_grad
+```
diff --git a/chapter_deep-learning-computation/use-gpu.md b/chapter_deep-learning-computation/use-gpu.md
@@ -480,7 +480,7 @@ Let the trainer to support GPU.
 ```{.python .input}
 %%tab mxnet, pytorch
 @d2l.add_to_class(d2l.Trainer)  #@save
-def __init__(self, max_epochs, num_gpus=0):
+def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
     self.save_hyperparameters()
     self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))]
     
@@ -498,7 +498,7 @@ def prepare_model(self, model):
         if tab.selected('mxnet'):
             model.collect_params().reset_ctx(self.gpus[0])
         if tab.selected('pytorch'):
-            model.net.to(self.gpus[0])
+            model.to(self.gpus[0])
     self.model = model
 ```
 

diff --git a/chapter_linear-networks/api.md b/chapter_linear-networks/api.md
@@ -6,14 +6,14 @@ tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
 # The D2L APIs
 :label:`sec_d2l_apis`
 
-Linear regression is one of the simplest machine learning models. Training it, 
-however, uses many of the same components as other models in this book require. 
-Therefore, before diving into the details it is worth reviewing some of the 
-functionality of the D2L library used throughout this book. This will greatly 
-streamline the presentation and you might even want to use it in your projects. 
+Linear regression is one of the simplest machine learning models. Training it,
+however, uses many of the same components as other models in this book require.
+Therefore, before diving into the details it is worth reviewing some of the
+functionality of the D2L library used throughout this book. This will greatly
+streamline the presentation and you might even want to use it in your projects.
 
-At its core we have three classes: `Module` contains models, losses and optimization methods; `DataModule` provides data loaders for training and validation. Both classes are combined using the `Trainer` class. It allows us to 
-train models on a variety of hardware platforms. Most code in this book adapts `Module` and `DataModule`. We will touch upon the `Trainer` class only when we discuss GPUs, CPUs, parallel training and optimization algorithms. 
+At its core we have three classes: `Module` contains models, losses and optimization methods; `DataModule` provides data loaders for training and validation. Both classes are combined using the `Trainer` class. It allows us to
+train models on a variety of hardware platforms. Most code in this book adapts `Module` and `DataModule`. We will touch upon the `Trainer` class only when we discuss GPUs, CPUs, parallel training and optimization algorithms.
 
 ```{.python .input}
 %%tab mxnet
@@ -42,8 +42,8 @@ import tensorflow as tf
 
 ## Utilities
 
-We need a few utilities to simplify object oriented programming in notebooks. One of the challenges is that class definitions tend to be fairly long blocks of code. Notebook readability demands short code fragments, interspersed with explanations, a requirement incompatible with the style of programming common for Python libraries. The first 
-utility function allows us to register functions as methods in a class *after* the class has been created. In fact, we can do so *even after* we've created instances of the class! It allows us to split the implementation of a class into multiple code blocks. 
+We need a few utilities to simplify object oriented programming in notebooks. One of the challenges is that class definitions tend to be fairly long blocks of code. Notebook readability demands short code fragments, interspersed with explanations, a requirement incompatible with the style of programming common for Python libraries. The first
+utility function allows us to register functions as methods in a class *after* the class has been created. In fact, we can do so *even after* we've created instances of the class! It allows us to split the implementation of a class into multiple code blocks.
 
 ```{.python .input}
 %%tab all
@@ -75,7 +75,7 @@ def do(self):
 a.do()
 ```
 
-The second one is a utility class that saves all arguments in a class's `__init__` method as class attributes. This allows us to extend constructor call signatures implicitly without additional code. 
+The second one is a utility class that saves all arguments in a class's `__init__` method as class attributes. This allows us to extend constructor call signatures implicitly without additional code.
 
 ```{.python .input}
 %%tab all
@@ -97,7 +97,7 @@ class B(d2l.HyperParameters):  # call the one saved in d2l with code implementat
 B(a=1, b=2, c=3);
 ```
 
-The last utility allows us to plot experiment progress interactively while it is going on. In deference to the much more powerful (and complex) [Tensorboard](https://www.tensorflow.org/tensorboard) we name it `ProgressBoard`. The  implementation is deferred to :numref:`sec_utils`. For now, let's simply see it in action. 
+The last utility allows us to plot experiment progress interactively while it is going on. In deference to the much more powerful (and complex) [Tensorboard](https://www.tensorflow.org/tensorboard) we name it `ProgressBoard`. The  implementation is deferred to :numref:`sec_utils`. For now, let's simply see it in action.
 
 The `draw` function plots a point `(x, y)` in the figure, with `label` specific the legend. The optional `every_n` smooths the line by only showing $1/n$ points in the figure. Their values are averaged from the $n$ neighbor points in the original figure.
 
@@ -133,42 +133,52 @@ Sometimes we put the code to compute the outputs into a separate `forward` metho
 ```{.python .input}
 %%tab all
 class Module(d2l.nn_Module, d2l.HyperParameters):  #@save
-    def __init__(self):
+    def __init__(self, plot_train_per_epoch=5, plot_valid_per_epoch=1):
         super().__init__()
+        self.save_hyperparameters()
         self.board = ProgressBoard()
         if tab.selected('tensorflow'):
             self.training = None
 
     def loss(self, y_hat, y):
         raise NotImplementedError
-        
+
     def forward(self, X):
         assert hasattr(self, 'net'), 'Neural network is defined'
         return self.net(X)
-    
+
     if tab.selected('tensorflow'):
-        def call(self, X, training=None):
+        def call(self, X, *args, training=None):
             if training is not None:
                 self.training = training
-            return self.forward(X)
+            return self.forward(X, *args)
+
+    def plot(self, key, value, train):
+        """Plot a point in animation."""
+        assert hasattr(self, 'trainer'), 'trainer is not inited'
+        self.board.xlabel = 'epoch'
+        if train:
+            x = self.trainer.train_batch_idx / \
+                self.trainer.num_train_batches
+            n = self.trainer.num_train_batches / \
+                self.plot_train_per_epoch
+        else:
+            x = self.trainer.epoch + 1
+            n = self.trainer.num_val_batches / \
+                self.plot_valid_per_epoch
+        self.board.draw(x, value, ('train_' if train else 'val_') + key,
+                        every_n=int(n))
 
     def training_step(self, batch):
         X, y = batch
         l = self.loss(self(X), y)
-        # Draw progress
-        assert hasattr(self, 'trainer'), 'Optimizer is defined'
-        num_train = self.trainer.num_train_batches
-        self.board.xlabel = 'epoch'
-        self.board.draw(self.trainer.train_batch_idx / num_train, l, 
-                        'train_loss', every_n=num_train // 5)
+        self.plot('loss', l, train=True)
         return l
 
     def validation_step(self, batch):
         X, y = batch
         l = self.loss(self(X), y)
-        # Draw progress
-        self.board.draw(self.trainer.epoch+1, l, 'val_loss', 
-                        every_n=self.trainer.num_val_batches)
+        self.plot('loss', l, train=False)
 
     def configure_optimizers(self):
         raise NotImplementedError
@@ -199,14 +209,14 @@ class DataModule(d2l.HyperParameters):  #@save
     if tab.selected('mxnet', 'pytorch'):
         def __init__(self, root='../data', num_workers=4):
             self.save_hyperparameters()
-            
+
     if tab.selected('tensorflow'):
         def __init__(self, root='../data'):
             self.save_hyperparameters()
 
     def get_dataloader(self, train):
         raise NotImplementedError
-        
+
     def train_dataloader(self):
         return self.get_dataloader(train=True)
 
@@ -221,7 +231,7 @@ The `Trainer` class trains the learnable parameters (aka weights) in the `Module
 ```{.python .input}
 %%tab all
 class Trainer(d2l.HyperParameters):  #@save
-    def __init__(self, max_epochs, num_gpus=0):
+    def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
         self.save_hyperparameters()
         assert num_gpus == 0, 'No GPU support yet'
 
@@ -249,11 +259,12 @@ class Trainer(d2l.HyperParameters):  #@save
 
     def fit_epoch(self):
         raise NotImplementedError
+
 ```
 
 ## Summary
 
-The classes provided by the D2L API function as a lightweight toolkit that make structured modeling for deep learning easy. In particular, it makes it easy to reuse many components between projects without changing much at all. For instance, we can replace just the optimizer, just the model, just the dataset, etc.; This degree of modularity pays dividends throughout the book in terms of conciseness and simplicity (this is why we added it) and it can do the same for your own projects. We strongly recommend that you look at the implementation in detail once you have gained some more familiarity with deep learning modeling. 
+The classes provided by the D2L API function as a lightweight toolkit that make structured modeling for deep learning easy. In particular, it makes it easy to reuse many components between projects without changing much at all. For instance, we can replace just the optimizer, just the model, just the dataset, etc.; This degree of modularity pays dividends throughout the book in terms of conciseness and simplicity (this is why we added it) and it can do the same for your own projects. We strongly recommend that you look at the implementation in detail once you have gained some more familiarity with deep learning modeling.
 
 ```{.python .input}
 

diff --git a/chapter_linear-networks/classification.md b/chapter_linear-networks/classification.md
@@ -30,28 +30,26 @@ from IPython import display
 
 ## The `Classification` Class
 
-We define the `Classification` class below. In the `validation_step` we report both the loss value and the classification accuracy on a validation batch. We draw an update for every `num_val_batches` batches. This has the benefit of generating the averaged loss and accuracy on the whole validation data. These average numbers are not exact correct if the last batch contains fewer examples, but we ignore this minor difference to keep the code simple. 
+We define the `Classification` class below. In the `validation_step` we report both the loss value and the classification accuracy on a validation batch. We draw an update for every `num_val_batches` batches. This has the benefit of generating the averaged loss and accuracy on the whole validation data. These average numbers are not exact correct if the last batch contains fewer examples, but we ignore this minor difference to keep the code simple.
 
 ```{.python .input}
 %%tab all
 class Classification(d2l.Module):  #@save
     def validation_step(self, batch):
         X, y = batch
         y_hat = self(X)
-        for k, v in (('val_loss', self.loss(y_hat, y)),
-                     ('val_acc', self.accuracy(y_hat, y))):
-            self.board.draw(self.trainer.epoch+1, v, k,
-                            every_n=self.trainer.num_val_batches)    
+        self.plot('loss', self.loss(y_hat, y), train=False)
+        self.plot('acc', self.accuracy(y_hat, y), train=False)
 ```
 
-By default we use a Stochastic Gradient Descent optimizer, operating on minibatches, just as we did in the context of linear regression. 
+By default we use a Stochastic Gradient Descent optimizer, operating on minibatches, just as we did in the context of linear regression.
 
 ```{.python .input}
 %%tab mxnet
 @d2l.add_to_class(d2l.Module)  #@save
 def configure_optimizers(self):
-    params = self.collect_params()
-    if isinstance(params, (tuple, list)):
+    params = self.parameters()
+    if isinstance(params, list):
         return d2l.SGD(params, self.lr)
     return gluon.Trainer(params,  'sgd', {'learning_rate': self.lr})
 ```
@@ -83,7 +81,7 @@ but at the end of the day it has to choose one among the classes.
 When predictions are consistent with the label class `y`, they are correct.
 The classification accuracy is the fraction of all predictions that are correct.
 Although it can be difficult to optimize accuracy directly (it is not differentiable),
-it is often the performance measure that we care about the most. It is often *the* 
+it is often the performance measure that we care about the most. It is often *the*
 relevant quantity in benchmarks. As such, we will nearly always report it when training classifiers.
 
 Accuracy is computed as follows:
@@ -107,17 +105,37 @@ def accuracy(self, y_hat, y):
     return d2l.reduce_mean(d2l.astype(cmp, d2l.float32))
 ```
 
-## Summary and Discussion
+```{.python .input}
+%%tab mxnet
+
+@d2l.add_to_class(d2l.Module)  #@save
+def get_scratch_params(self):
+    params = []
+    for attr in dir(self):
+        a = getattr(self, attr)
+        if isinstance(a, np.ndarray):
+            params.append(a)
+        if isinstance(a, d2l.Module):
+            params.extend(a.get_scratch_params())
+    return params
+
+@d2l.add_to_class(d2l.Module)  #@save
+def parameters(self):
+    params = self.collect_params()
+    return params if len(params.keys()) else self.get_scratch_params()
+```
+
+## Summary
 
-Classification is a sufficiently frequently used problem type that it warrants its own convenience functions. Note that there is a difference between (classification) accuracy that we want to minimize and the logistic loss function that we are actually minimizing. Fortunately, our specific choice of loss function ensures that minimizing it will also lead to maximum accuracy. This is the case since the maximum likelihood estimator is consistent. It follows as a special case of the Cramer-Rao bound :cite:`cramer1946mathematical,radhakrishna1945information`. For more work on consistency see also :cite:`zhang2004statistical`. 
+Classification is a sufficiently frequently used problem type that it warrants its own convenience functions. Note that there is a difference between (classification) accuracy that we want to minimize and the logistic loss function that we are actually minimizing. Fortunately, our specific choice of loss function ensures that minimizing it will also lead to maximum accuracy. This is the case since the maximum likelihood estimator is consistent. It follows as a special case of the Cramer-Rao bound :cite:`cramer1946mathematical,radhakrishna1945information`. For more work on consistency see also :cite:`zhang2004statistical`.
 
-More generally, though, the decision of which category to pick is far from trivial. For instance, when deciding where to assign an e-mail to, mistaking a "Primary" e-mail for a "Social" e-mail might be undesirable but far less disastrous than moving it to the spam folder (and later automatically deleting it). As such, we will tend to err on the side of caution with regard to assigning any e-mail to the "Spam" folder, rather than picking the most likely category. 
+More generally, though, the decision of which category to pick is far from trivial. For instance, when deciding where to assign an e-mail to, mistaking a "Primary" e-mail for a "Social" e-mail might be undesirable but far less disastrous than moving it to the spam folder (and later automatically deleting it). As such, we will tend to err on the side of caution with regard to assigning any e-mail to the "Spam" folder, rather than picking the most likely category.
 
 ## Exercises
 
-1. Denote by $L_v$ the validation loss, and let $L_v^q$ be its quick and dirty estimate computed by the loss function averaging in this section. Lastly, denote by $l_v^b$ the loss on the last minibatch. Express $L_v$ in terms of $L_v^q$, $l_v^b$, and the sample and minibatch sizes. 
+1. Denote by $L_v$ the validation loss, and let $L_v^q$ be its quick and dirty estimate computed by the loss function averaging in this section. Lastly, denote by $l_v^b$ the loss on the last minibatch. Express $L_v$ in terms of $L_v^q$, $l_v^b$, and the sample and minibatch sizes.
 1. Show that the quick and dirty estimate $L_v^q$ is unbiased. That is, show that $E[L_v] = E[L_v^q]$. Why would you still want to use $L_v$ instead?
-1. Given a multiclass classification loss, denoting by $l(y,y')$ the penalty of estimating $y'$ when we see $y$ and given a probabilty $p(y|x)$, formulate the rule for an optimal selection of $y'$. Hint: express the expected loss, using $l$ and $p(y|x)$. 
+1. Given a multiclass classification loss, denoting by $l(y,y')$ the penalty of estimating $y'$ when we see $y$ and given a probabilty $p(y|x)$, formulate the rule for an optimal selection of $y'$. Hint: express the expected loss, using $l$ and $p(y|x)$.
 
 ```{.python .input}