Added figure caption

Yongyao Jiang · Yongyao Jiang · commit 38322a997e65 · 2019-04-22T11:57:45.000-07:00
diff --git a/guide/14-deep-learning/how-ssd-works.ipynb b/guide/14-deep-learning/how-ssd-works.ipynb
@@ -18,9 +18,10 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Image classification in computer vision takes an image and predicts the object in an image, while object detection not only predicts the object but also finds their location in terms of bounding boxes. For example, when we build a swimming pool classifier, we take an input image and predict its class, while an object detection model would also tell us the location of the pool.\n",
+    "Image classification in computer vision takes an image and predicts the object in an image, while object detection not only predicts the object but also finds their location in terms of bounding boxes. For example, when we build a swimming pool classifier, we take an input image and predict whether it contains a pool, while an object detection model would also tell us the location of the pool.\n",
     "\n",
-    "<img src=\"img/classVdetection.png\" height=\"500\" width=\"500\">"
+    "<img src=\"img/classVdetection.png\" height=\"500\" width=\"500\">\n",
+    "<center>Figure 1. Difference between classification and object detection</center>"
    ]
   },
   {
@@ -54,7 +55,8 @@
     "\n",
     "To solve these problems, we would have to try out different sizes/shapes of sliding window, which is very computationally intensive, especially with deep neural network. \n",
     "\n",
-    "<img src=\"img/slidingwindow.gif\" height=\"500\" width=\"500\">"
+    "<img src=\"img/slidingwindow.gif\" height=\"500\" width=\"500\">\n",
+    "<center>Figure 2. Example of sliding window approach</center>"
    ]
   },
   {
@@ -85,7 +87,8 @@
     "SSD has two components: a __backbone__ model and __SSD head__. _Backbone_ model usually is a pre-trained image classification network as a feature extractor. This is typically a network like ResNet trained on ImageNet from which the final fully connected classification layer has been removed. We are thus left with a deep neural network that is able to extract semantic meaning from the input image while preserving the spatial structure of the image albeit at a lower resolution. For ResNet34, the backbone results in a 256 7x7 feature maps for an input image. We will explain what feature and feature map are later on. The _SSD head_ is just one or more convolutional layers added to this backbone and the outputs are interpreted as the bounding boxes and classes of objects in the spatial location of the final layers activations. \n",
     "\n",
     "In the figure below, the first few layers (white boxes) are the backbone, the last few layers (blue boxes) represent the SSD head.\n",
-    "<img src=\"img/ssd.png\" height=\"700\" width=\"700\">"
+    "<img src=\"https://cdn-images-1.medium.com/max/1000/1*GmJiirxTSuSVrh-r7gtJdA.png\">\n",
+    "<center>Figure 3. Architecture of a convolutional neural network with a SSD detector [2]</center>"
    ]
   },
   {
@@ -104,6 +107,7 @@
     "Instead of using sliding window, SSD divides the image using a grid and have each grid cell be responsible for detecting objects in that region of the image. Detection objects simply means predicting the class and location of an object within that region. If no object is present, we consider it as the background class and the location is ignored. For instance, we could use a 4x4 grid in the example below. Each grid cell is able to output the position and shape of the object it contains.\n",
     "\n",
     "<img src=\"img/gridcell.png\" height=\"300\" width=\"300\">\n",
+    "<center>Figure 4. Example of a 4x4 grid</center>\n",
     "\n",
     "Now you might be wondering what if there are multiple objects in one grid cell or we need to detect multiple objects of different shapes. There is where anchor box and receptive field come into play."
    ]
@@ -117,6 +121,8 @@
     "Each grid cell in SSD can be assigned with multiple anchor/prior boxes. These anchor boxes are pre-defined and each one is responsible for a size and shape within a grid cell. For example, the person in the image below corresponds to the taller anchor box while the car corresponds to the wider box.\n",
     "\n",
     "<img src=\"img/anchorbox.png\" height=\"480\" width=\"480\">\n",
+    "<center>Figure 5. Example of two anchor boxes</center>\n",
+    "\n",
     "\n",
     "SSD uses a matching phase while training, to match the appropriate anchor box with the bounding boxes of each ground truth object within an image. Essentially, the anchor box with the highest degree of overlap with an object is responsible for predicting that object’s class and its location. This property is used for training the network and for predicting the detected objects and their locations once the network has been trained. In practice, each anchor box is specified by an aspect ratio and a zoom level.\n",
     "\n",
@@ -125,6 +131,7 @@
     "Not all objects are square in shape. Some are longer and some are wider, by varying degrees. The SSD architecture allows pre-defined aspect ratios of the anchor boxes to account for this. The ratios parameter can be used to specify the different aspect ratios of the anchor boxes associates with each grid cell at each zoom/scale level.\n",
     "\n",
     "<img src=\"img/aspect.png\" height=\"700\" width=\"700\">\n",
+    "<center>Figure 6. Example of why we need to set aspect ratio</center>\n",
     "\n",
     "\n",
     "#### Zoom level\n",
@@ -140,6 +147,7 @@
     "\n",
     "Receptive field is defined as __the region in the input space that a particular CNN’s feature is looking at (i.e. be affected by)__. We will use \"feature\" and \"activation\" interchangeably here and treat them as the linear combination (sometimes applying an activation function after that to increase non-linearity) of the previous layer at the corresponding location [3]. Because of the the convolution operation, features at different layers represent different sizes of region in the input image. As it goes deeper, the size represented by a feature gets larger. In this example below, we start with the bottom layer (5x5) and then apply a convolution that results in the middle layer (3x3) where one feature (green pixel) represents a 3x3 region of the input layer (bottom layer). And then apply the convolution to middle layer and get the top layer (2x2) where each feature corresponds to a 7x7 region on the input image. These kind of green and orange 2D array are also called __feature maps__ which refer to a set of features created by applying the same feature extractor at different locations of the input map in a sliding window fastion. Features in the same feature map have the same receptive field and look for the same pattern but at different locations. This creates the spatial invariance of ConvNet.\n",
     "<img src=\"img/receptive1.png\" height=\"500\" width=\"500\">\n",
+    "<center>Figure 7. Visualizing CNN feature maps and receptive field</center>\n",
     "\n",
     "Receptive field is the central premise of the SSD architecture as it enables us to detect objects at different scales and output a tighter bounding box. Why? As you might still remember, the ResNet34 backbone outputs a 256 7x7 feature maps for an input image. If we specify a 4x4 grid, the simpliest approach is just to apply a convolution to this feature map and convert it to 4x4. This approach can actually work to some extent and is exatcly the idea of YOLO (You Only Look Once). The extra step taken by SSD is that it applies more convolutional layers to the backbone feature map and has each of these convolution layers output a object detection results. __As earlier layers bearing smaller receptive field can represent smaller sized objects, predictions from earlier layers help in dealing with smaller sized objects__.\n",
     "\n",
@@ -168,8 +176,16 @@
     "## References\n",
     "- [1] Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi: “You Only Look Once: Unified, Real-Time Object Detection”, 2015; <a href='https://arxiv.org/abs/1506.02640'>arXiv:1506.02640</a>.\n",
     "- [2] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu: “SSD: Single Shot MultiBox Detector”, 2016; <a href='http://arxiv.org/abs/1512.02325'>arXiv:1512.02325</a>.\n",
-    "- [3] Zeiler, Matthew D., and Rob Fergus. \"Visualizing and understanding convolutional networks.\" In European conference on computer vision, pp. 818-833. springer, Cham, 2014."
+    "- [3] Zeiler, Matthew D., and Rob Fergus. \"Visualizing and understanding convolutional networks.\" In European conference on computer vision, pp. 818-833. springer, Cham, 2014.\n",
+    "- [4] Dang Ha The Hien. A guide to receptive field arithmetic for Convolutional Neural Networks. https://medium.com/mlreview/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks-e0f514068807"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {