Hi, welcome to the Advanced Object Detection Blog Post series i.e. Single-stage object detectors – notably the YOLO & SSD class of object detectors. In our previous posts, we had a detailed look at YOLO and YOLOv2 (and YOLO9000 too). Now, it’s time to introduce another state-of-the-art object detector – Single Shot Detector. It has a few similarities to YOLO v2 – dividing the image into grid cells and using the anchor box approach for detection. YOLOv2 & SSD papers were published in 2016 in CVPR (International conference on Computer Vision & Pattern Recognition) & ECCV (European Conference on Computer Vision) conferences respectively. The SSD had a performance of 74.3% mAP @59 fps, which is better than both Faster R-CNN  (73% mAP @ 7 fps) & YOLO v1 (63.4% mAP @ 45 fps). Let’s get into the details of SSD, the architecture, loss function and we will compare it with YOLO v1 in various places in this article. Let’s assume that YOLOv2 is not yet out and we will always have this context of having Faster R-CNN & YOLO v1 as the most recent state-of-the-art object detectors. Well, let’s dive into SSD for now.

SSD uses the famed VGG-16 pretrained model (ImageNet weights) as the base model for feature extraction because it has been exceptional in Image Classification! To this base network, the FC (Fully Connected) layers were removed and 6 more auxiliary convolutional layers were added. These 6 layers had different kernel sizes which helped in detecting objects at multiple scales. Using the multi-scale feature maps aided in detecting all sized objects with different aspect ratios. The below image depicts the SSD300 (Image size of 300×300) architecture and we can observe the Conv6, Conv7, Conv8_2, Conv9_2, Conv10_2, and Conv11_2  are the extra layers with different multi-scale feature maps. This can alternatively be seen as a Pyramid representation of images at different scales.

SSD Architecture. Image Credit – SSD Paper

One outright observation is how the progressive convolutional layers decrease the feature map sizes and hence the depth is increased. The layers at the deep end have a larger receptive field (The receptive field is defined as the region in the input space that a particular CNN’s feature is looking at (i.e. be affected by). A receptive field of a feature can be described by its central location and its size and will construct abstract representations that help in detecting larger objects. The exact opposite happens in the initial layers where the receptive field is quite small which is useful in detecting the smaller-sized objects. This blows YOLOv1 out of proportion in terms of detecting both smaller and larger sized objects (recall that YOLOv1 failed to detect smaller sized objects). 

This means, in SSD, the detection happens in every pyramid layer which is added to the default VGG-16 neural net. The convolutional model for predicting the objects is different at each layer of the architecture as a multi-scaling technique is applied on the feature maps. This ensures the feature map is scaled to various sizes which in turn improves the probability of an object being detected at various sizes. The YOLOv1 had a single scale feature map which was not accurate in detecting smaller objects correctly.

The feature maps were of sizes 38×38, 19×19, 10×10, 5×5, 3×3, and 1×1. Coming to the mechanism of making predictions, for each cell in an image (also called location), it makes 4 object predictions (4 bounding boxes for each object detected). And each of these predictions is composed of a bounding box and 21 scores for each bounding box (20 classes + 1 class for no object). The SSD computes both the location and the class scores using the filters. A 3×3 Convolution filter is used to make these predictions. Every one of these 3×3 filters will output 25 channels – 21 classes + Bounding box. The below image describes the feature maps over an image and how different scale objects are detected.

Lower and higher resolution feature mappings. Image Credits – medium.com

The authors associated default boxes with each cell of the feature map cell which arranged the feature map in a convolution manner. This means each box is fixed relative to its corresponding cell. And now, at each cell of the feature map, the predictions obtained are the offset values to the default bounding boxes. We also get the per-class scores which indicate the confidence levels assigned for each object present in that particular bounding box. And now at each feature map cell, the predictions are the offsets to the default bounding box and the per-class scores to indicate the confidence level in each object which may be present in that particular bounding box (Default boxes are nothing but Anchor boxes). For every k location, there are c number of class scores and 4 offsets which makes it a (c + 4)k multiplied by m*n for each m*n feature map. So, effectively we are getting multiple boxes for a single image, and voila that’s the title justified! Now, let’s calculate how many boxes we get. Conv4_3 : 38x38x512 – 38*38*4*(21+4) = 5776 boxes, Conv7 : 19x19x6 – 19*19*6 = 2166 boxes and so on we get Con8_2 : 600 boxes, Conv9_2 : 150 boxes, Conv10_2 : 36 boxes and Conv11_2 : 4 boxes. We add up to 8732 boxes, which is magnitudes higher than 98 boxes of YOLOv1. It’s time now to get into the loss function.

The loss function of SSD is simply a Multibox’s loss function. And originally Multibox is a method for a fast class-agnostic bounding box technique coordinate proposals. In multibox, the authors created priors (anchors) which are pre-computed and are fixed size bounding boxes that closely match the ground truth boxes. Here, the Jaccard overlap threshold is set at 0.5. The multibox starts with the priors as predictions and attempts to regress closer to the ground truth bounding boxes. The default bounding boxes are chosen manually and are scaled up using a scale value with the value starting from s=0.2 to s=0.9 (increased in a linear fashion). There are 5 aspect ratios for the boxes: – 1, 2, 3, ½, and 1/3. The YOLOv1 uses a clustering algorithm to determine the default boxes (recall from our YOLO article). We will revisit the selection of default boxes at a later point in this article. These default boxes are similar to the anchor boxes used in Faster R-CNN, the only difference being – In SSD, it is applied to several feature maps of different resolutions. The below image shows us the default boxes for an image.

The Green boxes are the default boxes that match at least one ground truth box. Black boxes are the default boxes which were not assigned a label at all. Image Credits – medium.com

The loss function is a combination of Confidence Loss and Location Loss. Confidence loss is wrt the measure of the object inside the bounding box which is computed using Categorical Cross-entropy. Location loss deals with the bounding box offset values which is a smooth L1 loss between the predicted box and the ground truth values. 

The combined Loss Function for SSD: Single Shot MultiBox Detector. It is the weighted sum of localization and confidence loss where, N: no. of matched default boxes, l: predicted bounding box, g: ground truth box.

Localization Loss (Smooth L1-Loss) where offset center (cx, cy) of the default bounding box (d) and for its width (w) and height (h). Smooth L1 Loss as given in the Fast R-CNN papers,

smoothl1(x) = 0.5*x**2 / beta ; if and only if abs(x)<beta

smoothl1(x) = abs(x) – 0.5 * beta otherwise

Coming to the Confidence Loss given below, it is a softmax loss over multiple classes confidence (c)

Image Credit – SSD Paper

During the training of the SSD, choosing the set of default boxes and the scales for detection are very important. The default boxes play a massive role within the SSD architecture, we shall see the scale values (Remember SSD is a multi-scale object detector) and the aspect ratios for choosing our default boxes. In order to effectively detect objects at various scales, the first approach would be to use different sized images while training and combining them later. But, using feature maps at various layers will mimic the same effect in a single network pass. Specific feature maps learn to respond to specific scales of the object. The scale for the default box for each feature map is given below, where m is the number of feature maps used for prediction. We can calculate Sk for the kth feature map with Smin and Smax 0.2 and 0.9 respectively. Means s value is 0.2 at the lowest layer and 0.9 at the highest layer. For each scale, Sk, the 5 non-square aspect ratios as discussed earlier are {1, 2, 3, ½, ⅓ }. We can have up to 6 bounding boxes (5 + 1 with an aspect ratio of 1:1) per feature map location.

By combining predictions for all default boxes with various scales and aspect ratios across the layers, the set of predictions are very diverse and tend to generalize really well for objects of multiple sizes and aspect ratios. For example, in the given below image, the dog is matched to a default box in the 4×4 feature map not to any of the 8×8 feature maps because at that scale it does not match the dog box and is considered negative while training. The matching strategy used to score or select the bounding boxes while training is comparing each ground truth box to the default box with the best Jaccard overlap. More on Jaccard overlap here. This allows the network to predict higher scores for multiple overlapping default boxes rather than pick only one with the maximum overlap.

Image Credit – SSD Paper

Most of the boxes are negative after the matching step and are used as negatives, but this will put forth the problem of imbalance between positive and negative examples. To counter this, only the top 3 negative boxes which had the highest confidence loss between the negative examples and the default box are used and the ratio of negative examples for every positive example is kept at 3:1. This is called Hard negative mining and leads to stable and optimized training. Adding data augmentations during the training made the model more robust to various input sizes and shapes. The techniques used were random sampling, sampling an image patch with the minimum Jaccard overlap of 0.1, 0.3, 0.5, 0.7, and 0.9 keeping the aspect ratio between ½ and 2. Horizontal flipping too was used with the probability at 0.5. It was trained on PASCAL VOC 2007, 2012, and MS COCO datasets. Time to see the results now.

At inference time, the SSD300 is the very first real-time object detector with mAP of more than 70%. Fast YOLO had a performance of 155fps but wasn’t that accurate (52.7% mAP). SSD300 achieves an 74.3% mAP @59 fps. The authors found that VGG-16 might become computationally intensive as the basic VGG network is taking more than 80% of the total computation, using a better network might be quite faster. The SSD512 (Image size of 512×512) achieves a slightly higher mAP of 76.8 at a much lower fps of around ~22 fps. There were improvements done on top of this performance with respect to improving small object detection, a slightly different convolution technique, and a few more changes. I will strongly recommend reading the research paper for seeing how the improvements were made on top of baseline performance. Few surprises are there for you in that paper! 

We have now comprehensively understood SSD and had comparisons with YOLO here and there in the article. If you want a detailed comparison between them or any set of object detectors, put it down in the comments below. Until then, try implementing the SSD if you can, and in our next post, we will get into one more single-stage object detector which has a focal loss and featured image pyramid – RetinaNet – before going back to the YOLOv3 & YOLOv4. Share your thoughts on this article and put down any architecture you want to be reviewed in the comments below – Image Classification, Object Detection, and Image Segmentation networks, and I will come back with a detailed review.


Pranav Raikote


  1. SSD Paper: https://arxiv.org/pdf/1512.02325.pdf
  2. Multibox Paper: https://arxiv.org/abs/1412.1441
  3. Jaccard Overlap/Coefficient: https://stats.stackexchange.com/a/307597
  4. TensorFlow implementation: https://github.com/pierluigiferrari/ssd_keras
  5. PyTorch implementation: https://github.com/amdegroot/ssd.pytorch

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s