In :numref:chapter_anchor
, we generated multiple anchor boxes centered on each pixel of the input image. These anchor boxes are used to sample different regions of the input image. However, if anchor boxes are generated centered on each pixel of the image, soon there will be too many anchor boxes for us to compute. For example, we assume that the input image has a height and a width of 561 and 728 pixels respectively. If five different shapes of anchor boxes are generated centered on each pixel, over two million anchor boxes (
It is not difficult to reduce the number of anchor boxes. An easy way is to apply uniform sampling on a small portion of pixels from the input image and generate anchor boxes centered on the sampled pixels. In addition, we can generate anchor boxes of varied numbers and sizes on multiple scales. Notice that smaller objects are more likely to be positioned on the image than larger ones. Here, we will use a simple example: Objects with shapes of
To demonstrate how to generate anchor boxes on multiple scales, let us read an image first. It has a height and width of 561 * 728 pixels.
%matplotlib inline
import d2l
from mxnet import contrib, image, nd
img = image.imread('../img/catdog.jpg')
h, w = img.shape[0:2]
h, w
In
:numref:chapter_conv_layer
, the 2D array output of the convolutional neural network (CNN) is called
a feature map. We can determine the midpoints of anchor boxes uniformly sampled
on any image by defining the shape of the feature map.
The function display_anchors
is defined below. We are going to generate anchor boxes anchors
centered on each unit (pixel) on the feature map fmap
. Since the coordinates of axes anchors
have been divided by the width and height of the feature map fmap
, values between 0 and 1 can be used to represent relative positions of anchor boxes in the feature map. Since the midpoints of anchor boxes anchors
overlap with all the units on feature map fmap
, the relative spatial positions of the midpoints of the anchors
on any image must have a uniform distribution. Specifically, when the width and height of the feature map are set to fmap_w
and fmap_h
respectively, the function will conduct uniform sampling for fmap_h
rows and fmap_w
columns of pixels and use them as midpoints to generate anchor boxes with size s
(we assume that the length of list s
is 1) and different aspect ratios (ratios
).
def display_anchors(fmap_w, fmap_h, s):
d2l.set_figsize((3.5, 2.5))
# The values from the first two dimensions will not affect the output
fmap = nd.zeros((1, 10, fmap_w, fmap_h))
anchors = contrib.nd.MultiBoxPrior(fmap, sizes=s, ratios=[1, 2, 0.5])
bbox_scale = nd.array((w, h, w, h))
d2l.show_bboxes(d2l.plt.imshow(img.asnumpy()).axes,
anchors[0] * bbox_scale)
We will first focus on the detection of small objects. In order to make it easier to distinguish upon display, the anchor boxes with different midpoints here do not overlap. We assume that the size of the anchor boxes is 0.15 and the height and width of the feature map are 4. We can see that the midpoints of anchor boxes from the 4 rows and 4 columns on the image are uniformly distributed.
display_anchors(fmap_w=4, fmap_h=4, s=[0.15])
We are going to reduce the height and width of the feature map by half and use a larger anchor box to detect larger objects. When the size is set to 0.4, overlaps will occur between regions of some anchor boxes.
display_anchors(fmap_w=2, fmap_h=2, s=[0.4])
Finally, we are going to reduce the height and width of the feature map by half and increase the anchor box size to 0.8. Now the midpoint of the anchor box is the center of the image.
display_anchors(fmap_w=1, fmap_h=1, s=[0.8])
Since we have generated anchor boxes of different sizes on multiple scales, we will use them to detect objects of various sizes at different scales. Now we are going to introduce a method based on convolutional neural networks (CNNs).
At a certain scale, suppose we generate
We assume that the chapter_conv_layer
, the
When the feature maps of different layers have receptive fields of different sizes on the input image, they are used to detect objects of different sizes. For example, we can design a network to have a wider receptive field for each unit in the feature map that is closer to the output layer, to detect objects with larger sizes in the input image.
We will implement a multiscale object detection model in the following section.
- We can generate anchor boxes with different numbers and sizes on multiple scales to detect objects of different sizes on multiple scales.
- The shape of the feature map can be used to determine the midpoint of the anchor boxes that uniformly sample any image.
- We use the information for the input image from a certain receptive field to predict the category and offset of the anchor boxes close to that field on the image.
- Given an input image, assume
$1 \times c_i \times h \times w$ to be the shape of the feature map while$c_i, h, w$ are the number, height, and width of the feature map. What methods can you think of to convert this variable into the anchor box's category and offset? What is the shape of the output?