Skip to content

Commit

Permalink
Code refactor in RNNs (d2l-ai#1897)
Browse files Browse the repository at this point in the history
* lm dataset + text process

* sequence

* rnn

* stack

* stack

* update lib

* add plot to class Module

* rnn scratch

* lib

* rnn

* update rnn

* update lib

* fix

* fix

* fix
  • Loading branch information
mli authored Aug 26, 2021
1 parent 63cafb0 commit f3ec8f2
Show file tree
Hide file tree
Showing 21 changed files with 1,689 additions and 2,158 deletions.
55 changes: 53 additions & 2 deletions chapter_appendix-tools-for-deep-learning/utils.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ def save_hyperparameters(self, ignore=[]):
frame = inspect.currentframe().f_back
_, _, _, local_vars = inspect.getargvalues(frame)
self.hparams = {k:v for k, v in local_vars.items()
if k not in set(ignore+['self'])}
if k not in set(ignore+['self']) and not k.startswith('_')}
for k, v in self.hparams.items():
setattr(self, k, v)
```
Expand Down Expand Up @@ -187,6 +187,17 @@ def train_ch6(net, train_iter, test_iter, num_epochs, lr, device):
f'test acc {test_acc:.3f}')
print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
f'on {str(device)}')
def grad_clipping(net, theta): #@save
"""Clip the gradient."""
if isinstance(net, gluon.Block):
params = [p.data() for p in net.collect_params().values()]
else:
params = net.params
norm = math.sqrt(sum((p.grad ** 2).sum() for p in params))
if norm > theta:
for param in params:
param.grad[:] *= theta / norm
```

```{.python .input}
Expand Down Expand Up @@ -492,7 +503,6 @@ def accuracy(y_hat, y): #@save
```{.python .input}
%%tab all
%%tab all
import os
import requests
import zipfile
Expand Down Expand Up @@ -557,6 +567,11 @@ def download_extract(name, folder=None): #@save
return os.path.join(base_dir, folder) if folder else data_dir
def tokenize(lines, token='word'): #@save
"""Split text lines into word or character tokens."""
assert token in ('word', 'char'), 'Unknown token type: ' + token
return [line.split() if token == 'word' else list(line) for line in lines]
```

```{.python .input}
Expand All @@ -583,3 +598,39 @@ def evaluate_loss(net, data_iter, loss): #@save
metric.add(d2l.reduce_sum(l), d2l.size(l))
return metric[0] / metric[1]
```

```{.python .input}
#@tab pytorch
def grad_clipping(net, theta): #@save
"""Clip the gradient."""
if isinstance(net, nn.Module):
params = [p for p in net.parameters() if p.requires_grad]
else:
params = net.params
norm = torch.sqrt(sum(torch.sum((p.grad ** 2)) for p in params))
if norm > theta:
for param in params:
param.grad[:] *= theta / norm
```

```{.python .input}
#@tab tensorflow
def grad_clipping(grads, theta): #@save
"""Clip the gradient."""
theta = tf.constant(theta, dtype=tf.float32)
new_grad = []
for grad in grads:
if isinstance(grad, tf.IndexedSlices):
new_grad.append(tf.convert_to_tensor(grad))
else:
new_grad.append(grad)
norm = tf.math.sqrt(sum((tf.reduce_sum(grad ** 2)).numpy()
for grad in new_grad))
norm = tf.cast(norm, tf.float32)
if tf.greater(norm, theta):
for i, grad in enumerate(new_grad):
new_grad[i] = grad * theta / norm
else:
new_grad = new_grad
return new_grad
```
4 changes: 2 additions & 2 deletions chapter_deep-learning-computation/use-gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -480,7 +480,7 @@ Let the trainer to support GPU.
```{.python .input}
%%tab mxnet, pytorch
@d2l.add_to_class(d2l.Trainer) #@save
def __init__(self, max_epochs, num_gpus=0):
def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
self.save_hyperparameters()
self.gpus = [d2l.gpu(i) for i in range(min(num_gpus, d2l.num_gpus()))]
Expand All @@ -498,7 +498,7 @@ def prepare_model(self, model):
if tab.selected('mxnet'):
model.collect_params().reset_ctx(self.gpus[0])
if tab.selected('pytorch'):
model.net.to(self.gpus[0])
model.to(self.gpus[0])
self.model = model
```

Expand Down
69 changes: 40 additions & 29 deletions chapter_linear-networks/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,14 @@ tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
# The D2L APIs
:label:`sec_d2l_apis`

Linear regression is one of the simplest machine learning models. Training it,
however, uses many of the same components as other models in this book require.
Therefore, before diving into the details it is worth reviewing some of the
functionality of the D2L library used throughout this book. This will greatly
streamline the presentation and you might even want to use it in your projects.
Linear regression is one of the simplest machine learning models. Training it,
however, uses many of the same components as other models in this book require.
Therefore, before diving into the details it is worth reviewing some of the
functionality of the D2L library used throughout this book. This will greatly
streamline the presentation and you might even want to use it in your projects.

At its core we have three classes: `Module` contains models, losses and optimization methods; `DataModule` provides data loaders for training and validation. Both classes are combined using the `Trainer` class. It allows us to
train models on a variety of hardware platforms. Most code in this book adapts `Module` and `DataModule`. We will touch upon the `Trainer` class only when we discuss GPUs, CPUs, parallel training and optimization algorithms.
At its core we have three classes: `Module` contains models, losses and optimization methods; `DataModule` provides data loaders for training and validation. Both classes are combined using the `Trainer` class. It allows us to
train models on a variety of hardware platforms. Most code in this book adapts `Module` and `DataModule`. We will touch upon the `Trainer` class only when we discuss GPUs, CPUs, parallel training and optimization algorithms.

```{.python .input}
%%tab mxnet
Expand Down Expand Up @@ -42,8 +42,8 @@ import tensorflow as tf

## Utilities

We need a few utilities to simplify object oriented programming in notebooks. One of the challenges is that class definitions tend to be fairly long blocks of code. Notebook readability demands short code fragments, interspersed with explanations, a requirement incompatible with the style of programming common for Python libraries. The first
utility function allows us to register functions as methods in a class *after* the class has been created. In fact, we can do so *even after* we've created instances of the class! It allows us to split the implementation of a class into multiple code blocks.
We need a few utilities to simplify object oriented programming in notebooks. One of the challenges is that class definitions tend to be fairly long blocks of code. Notebook readability demands short code fragments, interspersed with explanations, a requirement incompatible with the style of programming common for Python libraries. The first
utility function allows us to register functions as methods in a class *after* the class has been created. In fact, we can do so *even after* we've created instances of the class! It allows us to split the implementation of a class into multiple code blocks.

```{.python .input}
%%tab all
Expand Down Expand Up @@ -75,7 +75,7 @@ def do(self):
a.do()
```

The second one is a utility class that saves all arguments in a class's `__init__` method as class attributes. This allows us to extend constructor call signatures implicitly without additional code.
The second one is a utility class that saves all arguments in a class's `__init__` method as class attributes. This allows us to extend constructor call signatures implicitly without additional code.

```{.python .input}
%%tab all
Expand All @@ -97,7 +97,7 @@ class B(d2l.HyperParameters): # call the one saved in d2l with code implementat
B(a=1, b=2, c=3);
```

The last utility allows us to plot experiment progress interactively while it is going on. In deference to the much more powerful (and complex) [Tensorboard](https://www.tensorflow.org/tensorboard) we name it `ProgressBoard`. The implementation is deferred to :numref:`sec_utils`. For now, let's simply see it in action.
The last utility allows us to plot experiment progress interactively while it is going on. In deference to the much more powerful (and complex) [Tensorboard](https://www.tensorflow.org/tensorboard) we name it `ProgressBoard`. The implementation is deferred to :numref:`sec_utils`. For now, let's simply see it in action.

The `draw` function plots a point `(x, y)` in the figure, with `label` specific the legend. The optional `every_n` smooths the line by only showing $1/n$ points in the figure. Their values are averaged from the $n$ neighbor points in the original figure.

Expand Down Expand Up @@ -133,42 +133,52 @@ Sometimes we put the code to compute the outputs into a separate `forward` metho
```{.python .input}
%%tab all
class Module(d2l.nn_Module, d2l.HyperParameters): #@save
def __init__(self):
def __init__(self, plot_train_per_epoch=5, plot_valid_per_epoch=1):
super().__init__()
self.save_hyperparameters()
self.board = ProgressBoard()
if tab.selected('tensorflow'):
self.training = None
def loss(self, y_hat, y):
raise NotImplementedError
def forward(self, X):
assert hasattr(self, 'net'), 'Neural network is defined'
return self.net(X)
if tab.selected('tensorflow'):
def call(self, X, training=None):
def call(self, X, *args, training=None):
if training is not None:
self.training = training
return self.forward(X)
return self.forward(X, *args)
def plot(self, key, value, train):
"""Plot a point in animation."""
assert hasattr(self, 'trainer'), 'trainer is not inited'
self.board.xlabel = 'epoch'
if train:
x = self.trainer.train_batch_idx / \
self.trainer.num_train_batches
n = self.trainer.num_train_batches / \
self.plot_train_per_epoch
else:
x = self.trainer.epoch + 1
n = self.trainer.num_val_batches / \
self.plot_valid_per_epoch
self.board.draw(x, value, ('train_' if train else 'val_') + key,
every_n=int(n))
def training_step(self, batch):
X, y = batch
l = self.loss(self(X), y)
# Draw progress
assert hasattr(self, 'trainer'), 'Optimizer is defined'
num_train = self.trainer.num_train_batches
self.board.xlabel = 'epoch'
self.board.draw(self.trainer.train_batch_idx / num_train, l,
'train_loss', every_n=num_train // 5)
self.plot('loss', l, train=True)
return l
def validation_step(self, batch):
X, y = batch
l = self.loss(self(X), y)
# Draw progress
self.board.draw(self.trainer.epoch+1, l, 'val_loss',
every_n=self.trainer.num_val_batches)
self.plot('loss', l, train=False)
def configure_optimizers(self):
raise NotImplementedError
Expand Down Expand Up @@ -199,14 +209,14 @@ class DataModule(d2l.HyperParameters): #@save
if tab.selected('mxnet', 'pytorch'):
def __init__(self, root='../data', num_workers=4):
self.save_hyperparameters()
if tab.selected('tensorflow'):
def __init__(self, root='../data'):
self.save_hyperparameters()
def get_dataloader(self, train):
raise NotImplementedError
def train_dataloader(self):
return self.get_dataloader(train=True)
Expand All @@ -221,7 +231,7 @@ The `Trainer` class trains the learnable parameters (aka weights) in the `Module
```{.python .input}
%%tab all
class Trainer(d2l.HyperParameters): #@save
def __init__(self, max_epochs, num_gpus=0):
def __init__(self, max_epochs, num_gpus=0, gradient_clip_val=0):
self.save_hyperparameters()
assert num_gpus == 0, 'No GPU support yet'
Expand Down Expand Up @@ -249,11 +259,12 @@ class Trainer(d2l.HyperParameters): #@save
def fit_epoch(self):
raise NotImplementedError
```

## Summary

The classes provided by the D2L API function as a lightweight toolkit that make structured modeling for deep learning easy. In particular, it makes it easy to reuse many components between projects without changing much at all. For instance, we can replace just the optimizer, just the model, just the dataset, etc.; This degree of modularity pays dividends throughout the book in terms of conciseness and simplicity (this is why we added it) and it can do the same for your own projects. We strongly recommend that you look at the implementation in detail once you have gained some more familiarity with deep learning modeling.
The classes provided by the D2L API function as a lightweight toolkit that make structured modeling for deep learning easy. In particular, it makes it easy to reuse many components between projects without changing much at all. For instance, we can replace just the optimizer, just the model, just the dataset, etc.; This degree of modularity pays dividends throughout the book in terms of conciseness and simplicity (this is why we added it) and it can do the same for your own projects. We strongly recommend that you look at the implementation in detail once you have gained some more familiarity with deep learning modeling.

```{.python .input}
Expand Down
46 changes: 32 additions & 14 deletions chapter_linear-networks/classification.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,28 +30,26 @@ from IPython import display

## The `Classification` Class

We define the `Classification` class below. In the `validation_step` we report both the loss value and the classification accuracy on a validation batch. We draw an update for every `num_val_batches` batches. This has the benefit of generating the averaged loss and accuracy on the whole validation data. These average numbers are not exact correct if the last batch contains fewer examples, but we ignore this minor difference to keep the code simple.
We define the `Classification` class below. In the `validation_step` we report both the loss value and the classification accuracy on a validation batch. We draw an update for every `num_val_batches` batches. This has the benefit of generating the averaged loss and accuracy on the whole validation data. These average numbers are not exact correct if the last batch contains fewer examples, but we ignore this minor difference to keep the code simple.

```{.python .input}
%%tab all
class Classification(d2l.Module): #@save
def validation_step(self, batch):
X, y = batch
y_hat = self(X)
for k, v in (('val_loss', self.loss(y_hat, y)),
('val_acc', self.accuracy(y_hat, y))):
self.board.draw(self.trainer.epoch+1, v, k,
every_n=self.trainer.num_val_batches)
self.plot('loss', self.loss(y_hat, y), train=False)
self.plot('acc', self.accuracy(y_hat, y), train=False)
```

By default we use a Stochastic Gradient Descent optimizer, operating on minibatches, just as we did in the context of linear regression.
By default we use a Stochastic Gradient Descent optimizer, operating on minibatches, just as we did in the context of linear regression.

```{.python .input}
%%tab mxnet
@d2l.add_to_class(d2l.Module) #@save
def configure_optimizers(self):
params = self.collect_params()
if isinstance(params, (tuple, list)):
params = self.parameters()
if isinstance(params, list):
return d2l.SGD(params, self.lr)
return gluon.Trainer(params, 'sgd', {'learning_rate': self.lr})
```
Expand Down Expand Up @@ -83,7 +81,7 @@ but at the end of the day it has to choose one among the classes.
When predictions are consistent with the label class `y`, they are correct.
The classification accuracy is the fraction of all predictions that are correct.
Although it can be difficult to optimize accuracy directly (it is not differentiable),
it is often the performance measure that we care about the most. It is often *the*
it is often the performance measure that we care about the most. It is often *the*
relevant quantity in benchmarks. As such, we will nearly always report it when training classifiers.

Accuracy is computed as follows:
Expand All @@ -107,17 +105,37 @@ def accuracy(self, y_hat, y):
return d2l.reduce_mean(d2l.astype(cmp, d2l.float32))
```

## Summary and Discussion
```{.python .input}
%%tab mxnet
@d2l.add_to_class(d2l.Module) #@save
def get_scratch_params(self):
params = []
for attr in dir(self):
a = getattr(self, attr)
if isinstance(a, np.ndarray):
params.append(a)
if isinstance(a, d2l.Module):
params.extend(a.get_scratch_params())
return params
@d2l.add_to_class(d2l.Module) #@save
def parameters(self):
params = self.collect_params()
return params if len(params.keys()) else self.get_scratch_params()
```

## Summary

Classification is a sufficiently frequently used problem type that it warrants its own convenience functions. Note that there is a difference between (classification) accuracy that we want to minimize and the logistic loss function that we are actually minimizing. Fortunately, our specific choice of loss function ensures that minimizing it will also lead to maximum accuracy. This is the case since the maximum likelihood estimator is consistent. It follows as a special case of the Cramer-Rao bound :cite:`cramer1946mathematical,radhakrishna1945information`. For more work on consistency see also :cite:`zhang2004statistical`.
Classification is a sufficiently frequently used problem type that it warrants its own convenience functions. Note that there is a difference between (classification) accuracy that we want to minimize and the logistic loss function that we are actually minimizing. Fortunately, our specific choice of loss function ensures that minimizing it will also lead to maximum accuracy. This is the case since the maximum likelihood estimator is consistent. It follows as a special case of the Cramer-Rao bound :cite:`cramer1946mathematical,radhakrishna1945information`. For more work on consistency see also :cite:`zhang2004statistical`.

More generally, though, the decision of which category to pick is far from trivial. For instance, when deciding where to assign an e-mail to, mistaking a "Primary" e-mail for a "Social" e-mail might be undesirable but far less disastrous than moving it to the spam folder (and later automatically deleting it). As such, we will tend to err on the side of caution with regard to assigning any e-mail to the "Spam" folder, rather than picking the most likely category.
More generally, though, the decision of which category to pick is far from trivial. For instance, when deciding where to assign an e-mail to, mistaking a "Primary" e-mail for a "Social" e-mail might be undesirable but far less disastrous than moving it to the spam folder (and later automatically deleting it). As such, we will tend to err on the side of caution with regard to assigning any e-mail to the "Spam" folder, rather than picking the most likely category.

## Exercises

1. Denote by $L_v$ the validation loss, and let $L_v^q$ be its quick and dirty estimate computed by the loss function averaging in this section. Lastly, denote by $l_v^b$ the loss on the last minibatch. Express $L_v$ in terms of $L_v^q$, $l_v^b$, and the sample and minibatch sizes.
1. Denote by $L_v$ the validation loss, and let $L_v^q$ be its quick and dirty estimate computed by the loss function averaging in this section. Lastly, denote by $l_v^b$ the loss on the last minibatch. Express $L_v$ in terms of $L_v^q$, $l_v^b$, and the sample and minibatch sizes.
1. Show that the quick and dirty estimate $L_v^q$ is unbiased. That is, show that $E[L_v] = E[L_v^q]$. Why would you still want to use $L_v$ instead?
1. Given a multiclass classification loss, denoting by $l(y,y')$ the penalty of estimating $y'$ when we see $y$ and given a probabilty $p(y|x)$, formulate the rule for an optimal selection of $y'$. Hint: express the expected loss, using $l$ and $p(y|x)$.
1. Given a multiclass classification loss, denoting by $l(y,y')$ the penalty of estimating $y'$ when we see $y$ and given a probabilty $p(y|x)$, formulate the rule for an optimal selection of $y'$. Hint: express the expected loss, using $l$ and $p(y|x)$.

```{.python .input}
Expand Down
Loading

0 comments on commit f3ec8f2

Please sign in to comment.