🏷️sec_bi_rnn
In sequence learning, so far we assumed that our goal is to model the next output given what we have seen so far, e.g., in the context of a time series or in the context of a language model. While this is a typical scenario, it is not the only one we might encounter. To illustrate the issue, consider the following three tasks of filling in the blank in a text sequence:
- I am
___
. - I am
___
hungry. - I am
___
hungry, and I can eat half a pig.
Depending on the amount of information available, we might fill in the blanks with very different words such as "happy", "not", and "very". Clearly the end of the phrase (if available) conveys significant information about which word to pick. A sequence model that is incapable of taking advantage of this will perform poorly on related tasks. For instance, to do well in named entity recognition (e.g., to recognize whether "Green" refers to "Mr. Green" or to the color) longer-range context is equally vital. To get some inspiration for addressing the problem let us take a detour to probabilistic graphical models.
Dynamic Programming in Hidden Markov Models
This subsection serves to illustrate the dynamic programming problem. The specific technical details do not matter for understanding the deep learning models but they help in motivating why one might use deep learning and why one might pick specific architectures.
If we want to solve the problem using probabilistic graphical models we could for instance design a latent variable model as follows.
At any time step fig_hmm
.
Thus,
for a sequence of
eq_hmm_jointP
Now assume that we observe all
To see how it works,
consider summing over latent variables
eq_hmm_jointP
,
this yields:
$$\begin{aligned} &P(x_1, \ldots, x_T) \ =& \sum_{h_1, \ldots, h_T} P(x_1, \ldots, x_T, h_1, \ldots, h_T) \ =& \sum_{h_1, \ldots, h_T} \prod_{t=1}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t) \ =& \sum_{h_2, \ldots, h_T} \underbrace{\left[\sum_{h_1} P(h_1) P(x_1 \mid h_1) P(h_2 \mid h_1)\right]}{\pi_2(h_2) \stackrel{\mathrm{def}}{=}} P(x_2 \mid h_2) \prod{t=3}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t) \ =& \sum_{h_3, \ldots, h_T} \underbrace{\left[\sum_{h_2} \pi_2(h_2) P(x_2 \mid h_2) P(h_3 \mid h_2)\right]}{\pi_3(h_3)\stackrel{\mathrm{def}}{=}} P(x_3 \mid h_3) \prod{t=4}^T P(h_t \mid h_{t-1}) P(x_t \mid h_t)\ =& \dots \ =& \sum_{h_T} \pi_T(h_T) P(x_T \mid h_T). \end{aligned}$$
In general we have the forward recursion as
The recursion is initialized as
Entirely analogously to the forward recursion, we can also sum over the same set of latent variables with a backward recursion. This yields:
$$\begin{aligned} & P(x_1, \ldots, x_T) \ =& \sum_{h_1, \ldots, h_T} P(x_1, \ldots, x_T, h_1, \ldots, h_T) \ =& \sum_{h_1, \ldots, h_T} \prod_{t=1}^{T-1} P(h_t \mid h_{t-1}) P(x_t \mid h_t) \cdot P(h_T \mid h_{T-1}) P(x_T \mid h_T) \ =& \sum_{h_1, \ldots, h_{T-1}} \prod_{t=1}^{T-1} P(h_t \mid h_{t-1}) P(x_t \mid h_t) \cdot \underbrace{\left[\sum_{h_T} P(h_T \mid h_{T-1}) P(x_T \mid h_T)\right]}{\rho{T-1}(h_{T-1})\stackrel{\mathrm{def}}{=}} \ =& \sum_{h_1, \ldots, h_{T-2}} \prod_{t=1}^{T-2} P(h_t \mid h_{t-1}) P(x_t \mid h_t) \cdot \underbrace{\left[\sum_{h_{T-1}} P(h_{T-1} \mid h_{T-2}) P(x_{T-1} \mid h_{T-1}) \rho_{T-1}(h_{T-1}) \right]}{\rho{T-2}(h_{T-2})\stackrel{\mathrm{def}}{=}} \ =& \ldots \ =& \sum_{h_1} P(h_1) P(x_1 \mid h_1)\rho_{1}(h_{1}). \end{aligned}$$
We can thus write the backward recursion as
with initialization Aji.McEliece.2000
.
Combining both forward and backward recursions, we are able to compute
Note that in abstract terms the backward recursion can be written as Doucet.De-Freitas.Gordon.2001
.
If we want to have a mechanism in RNNs that offers comparable look-ahead ability as in hidden Markov models, we need to modify the RNN design that we have seen so far. Fortunately, this is easy conceptually. Instead of running an RNN only in the forward mode starting from the first token, we start another one from the last token running from back to front.
Bidirectional RNNs add a hidden layer that passes information in a backward direction to more flexibly process such information. :numref:fig_birnn
illustrates the architecture of a bidirectional RNN with a single hidden layer.
In fact, this is not too dissimilar to the forward and backward recursions in the dynamic programing of hidden Markov models. The main distinction is that in the previous case these equations had a specific statistical meaning. Now they are devoid of such easily accessible interpretations and we can just treat them as generic and learnable functions. This transition epitomizes many of the principles guiding the design of modern deep networks: first, use the type of functional dependencies of classical statistical models, and then parameterize them in a generic form.
Bidirectional RNNs were introduced by :cite:Schuster.Paliwal.1997
.
For a detailed discussion of the various architectures see also the paper :cite:Graves.Schmidhuber.2005
.
Let us look at the specifics of such a network.
For any time step
$$ \begin{aligned} \overrightarrow{\mathbf{H}}t &= \phi(\mathbf{X}t \mathbf{W}{xh}^{(f)} + \overrightarrow{\mathbf{H}}{t-1} \mathbf{W}_{hh}^{(f)} + \mathbf{b}h^{(f)}),\ \overleftarrow{\mathbf{H}}t &= \phi(\mathbf{X}t \mathbf{W}{xh}^{(b)} + \overleftarrow{\mathbf{H}}{t+1} \mathbf{W}{hh}^{(b)} + \mathbf{b}_h^{(b)}), \end{aligned} $$
where the weights $\mathbf{W}{xh}^{(f)} \in \mathbb{R}^{d \times h}, \mathbf{W}{hh}^{(f)} \in \mathbb{R}^{h \times h}, \mathbf{W}{xh}^{(b)} \in \mathbb{R}^{d \times h}, \text{ and } \mathbf{W}{hh}^{(b)} \in \mathbb{R}^{h \times h}$, and biases
Next, we concatenate the forward and backward hidden states
$$\mathbf{O}_t = \mathbf{H}t \mathbf{W}{hq} + \mathbf{b}_q.$$
Here, the weight matrix
One of the key features of a bidirectional RNN is that information from both ends of the sequence is used to estimate the output. That is, we use information from both future and past observations to predict the current one. In the case of next token prediction this is not quite what we want. After all, we do not have the luxury of knowing the next to next token when predicting the next one. Hence, if we were to use a bidirectional RNN naively we would not get a very good accuracy: during training we have past and future data to estimate the present. During test time we only have past data and thus poor accuracy. We will illustrate this in an experiment below.
To add insult to injury, bidirectional RNNs are also exceedingly slow. The main reasons for this are that the forward propagation requires both forward and backward recursions in bidirectional layers and that the backpropagation is dependent on the outcomes of the forward propagation. Hence, gradients will have a very long dependency chain.
In practice bidirectional layers are used very sparingly and only for a narrow set of applications, such as filling in missing words, annotating tokens (e.g., for named entity recognition), and encoding sequences wholesale as a step in a sequence processing pipeline (e.g., for machine translation).
In :numref:sec_bert
and :numref:sec_sentiment_rnn
,
we will introduce how to use bidirectional RNNs
to encode text sequences.
If we were to ignore all advice regarding the fact that bidirectional RNNs use past and future data and simply apply it to language models, we will get estimates with acceptable perplexity. Nonetheless, the ability of the model to predict future tokens is severely compromised as the experiment below illustrates. Despite reasonable perplexity, it only generates gibberish even after many iterations. We include the code below as a cautionary example against using them in the wrong context.
from d2l import mxnet as d2l
from mxnet import npx
from mxnet.gluon import rnn
npx.set_np()
# Load data
batch_size, num_steps, device = 32, 35, d2l.try_gpu()
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
# Define the bidirectional LSTM model by setting `bidirectional=True`
vocab_size, num_hiddens, num_layers = len(vocab), 256, 2
lstm_layer = rnn.LSTM(num_hiddens, num_layers, bidirectional=True)
model = d2l.RNNModel(lstm_layer, len(vocab))
# Train the model
num_epochs, lr = 500, 1
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)
#@tab pytorch
from d2l import torch as d2l
import torch
from torch import nn
# Load data
batch_size, num_steps, device = 32, 35, d2l.try_gpu()
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
# Define the bidirectional LSTM model by setting `bidirectional=True`
vocab_size, num_hiddens, num_layers = len(vocab), 256, 2
num_inputs = vocab_size
lstm_layer = nn.LSTM(num_inputs, num_hiddens, num_layers, bidirectional=True)
model = d2l.RNNModel(lstm_layer, len(vocab))
model = model.to(device)
# Train the model
num_epochs, lr = 500, 1
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)
The output is clearly unsatisfactory for the reasons described above.
For a
discussion of more effective uses of bidirectional RNNs, please see the sentiment
analysis application
in :numref:sec_sentiment_rnn
.
- In bidirectional RNNs, the hidden state for each time step is simultaneously determined by the data prior to and after the current time step.
- Bidirectional RNNs bear a striking resemblance with the forward-backward algorithm in probabilistic graphical models.
- Bidirectional RNNs are mostly useful for sequence encoding and the estimation of observations given bidirectional context.
- Bidirectional RNNs are very costly to train due to long gradient chains.
- If the different directions use a different number of hidden units, how will the shape of
$\mathbf{H}_t$ change? - Design a bidirectional RNN with multiple hidden layers.
- Polysemy is common in natural languages. For example, the word "bank" has different meanings in contexts “i went to the bank to deposit cash” and “i went to the bank to sit down”. How can we design a neural network model such that given a context sequence and a word, a vector representation of the word in the context will be returned? What type of neural architectures is preferred for handling polysemy?
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab: