added attention visuals and notebooks

acasanez · Apr 8, 2021 · 8fd24ee · 8fd24ee
1 parent 5a1fd61
commit 8fd24ee
Show file tree

Hide file tree

Showing 7 changed files with 36 additions and 58 deletions.
diff --git a/assets/images/dot_product_attention.fla b/assets/images/dot_product_attention.fla
diff --git a/assets/images/dot_product_attention.png b/assets/images/dot_product_attention.png
diff --git a/course/attention/00_summary.ipynb b/course/attention/00_summary.ipynb
@@ -1,5 +1,28 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Attention\n",
+    "\n",
+    "In this section we'll cover four different types of attention mechanisms:\n",
+    "\n",
+    "* Dot-product (encoder-decoder) attention\n",
+    "\n",
+    "* Self attention\n",
+    "\n",
+    "* Bidirectional attention\n",
+    "\n",
+    "* Multihead attention\n",
+    "\n",
+    "Each of these mechanisms lend well to our understanding of modern day transformer models, which typically use a combination of these mechanisms - for example BERT which uses the dot-product attention, adapted for encoder-encoder mappings using self-attention, which is modified to bidirectional attention - and this operation is performed several times due to multihead attention.\n",
+    "\n",
+    "![Visual showing the focus of each attention mechanism](../../assets/images/attention_overview.png)\n",
+    "\n",
+    "Each row in the visual above corresponds to dot-product (encoder-decoder), self, bidirectional, and multihead attention respectively."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,

diff --git a/course/attention/01_dot_product_attention.ipynb b/course/attention/01_dot_product_attention.ipynb
@@ -6,7 +6,7 @@
    "source": [
     "# Dot-Product Attention\n",
     "\n",
-    "The first attention mechanism we will focus on is dot-product attention. When we perform many NLP tasks we would typically convert a word into a vector (*word2vec*), with transformers we perform the same operation. These vectors allows us to represent meaning numerically (eg days of the week may be clustered together, or we can perform logical arithmetic on the vectors - *King - Man + Woman = Queen*).\n",
+    "The first attention mechanism we will focus on is dot-product (encoder-decoder) attention. When we perform many NLP tasks we would typically convert a word into a vector (*word2vec*), with transformers we perform the same operation. These vectors allows us to represent meaning numerically (eg days of the week may be clustered together, or we can perform logical arithmetic on the vectors - *King - Man + Woman = Queen*).\n",
     "\n",
     "Because of this, we would expect sentences with similar meaning to have a similar set of values. For example, in neural machine translation, the phrase *\"Hello, how are you?\"*, and the Italian equivalent *\"Ciao, come va?\"* should share a similar matrix representation.\n",
     "\n",
@@ -187,6 +187,13 @@
     "\n",
     "Once we calculate the dot product, we apply a softmax function to convert the dot product alignment into probabilities. These are then multiplied by *V* to give us the attention tensor **z**."
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {

diff --git a/course/attention/02_causal_attention.ipynb b/course/attention/02_causal_attention.ipynb
diff --git a/course/attention/03_bidirectional_attention.ipynb b/course/attention/03_bidirectional_attention.ipynb
@@ -6,19 +6,10 @@
    "source": [
     "# Bi-directional Attention\n",
     "\n",
-    "TK explain\n",
+    "We've explored both dot-product attention, and self-attention. Where dot-product compared two sequences, and causal attention compared previous tokens from the *same sequence*, bidirectional attention compares tokens from the *same sequence* in both directions, subsequent and previous. This is as simple as performing the exact same operation that we performed for *self-attention*, but excluding the masking operation - allowing each word to be mapped to every other word in the same sequence. So, we could call this *bi-directional **self** attention*. This is particularly useful for masked language modeling - and is used in BERT (**Bidirectional Encoder** Representations from Transformers) - bidirectional self-attention refers to the *bidirectional encoder*, or the *BE* of BERT.\n",
     "\n",
-    "## From Scratch in Numpy\n",
-    "\n",
-    "TK work through example in Numpy"
+    "![Bidirectional Attention](../../assets/images/bidirectional_attention.png)"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {

diff --git a/course/attention/04_multihead_attention.ipynb b/course/attention/04_multihead_attention.ipynb
@@ -6,7 +6,9 @@
    "source": [
     "# Multihead Attention\n",
     "\n",
-    "TK explain\n",
+    "\n",
+    "\n",
+    "![Flow in multihead attention](../../assets/images/multihead_attention.png)\n",
     "\n",
     "## From Scratch in Numpy\n",
     "\n",