Hello, I’m back with an Advanced Object Detector article! We are now past the amateur stage if and only if you have read through all our previous articles on Object Detection, R-CNN, Fast R-CNN, SPPnet, Faster R-CNN & Mask R-CNN. Although Mask R-CNN is also a tad advanced, we considered learning related to the R-CNN Family of Object Detectors. We are on the right track to mastering Object Detection after learning the building blocks of an Object Detector and have a fair intuition about the mechanism of how various parts are put together for a fully functional Object Detector. We progressed through the R-CNN, Fast R-CNN & Faster R-CNN architectures and understood how evolution happens to improve the accuracy and reduce the inference time taken per image. 

The primary goal of any object detector is to be applicable in real-world applications with a 60 fps (Frames per second) throughput, considering the input will be a video stream most of the time. So far, the Faster R-CNN & Mask R-CNN had an inference throughput of up to 5s with ResNet CNN as the backbone network, which still is nowhere near the required 30 fps or 60 fps. The main reason is that all are Two-Stage Detectors – there are two or more networks arranged in more or less a sequential manner. Although shared computation mechanisms quicken the training and testing time (in Faster R-CNN), it still is a bottleneck for throughput performance. 

Enter the Single Stage Object Detectors – Object Detection using a single network, no branching of the network also, for classification & regression (Bounding Box) tasks. The most famous Object Detector in recent times known for its high accuracy and throughput is the YOLO architecture. The expansion of YOLO is “You Only Look Once”. We will deep dive into YOLO in this blog post and understand how it initiated a new era in the Object Detection space with its various variants making huge strides in the improvement for a faster and more accurate Object Detector. The YOLO has a performance of 45fps (which is pretty good, considering major cameras are shooting videos as 30fps) with its initial architecture and touches a monumental 144fps on its smaller architecture, which is slightly less accurate than the original network. One more interesting fact is the YOLO paper was awarded the Open CV’s People Choice Award at the CVPR (Conference on Computer Vision & Pattern Recognition) 2016.

Let’s start with understanding the flaws of Two-Stage Detectors and then move to how it can be solved with a single CNN architecture gradually diving into the YOLO architecture. First intuition would be to think about how to combine the Region Proposals Network (RPN) and the Classifier + Regressor Network. 

Can we just use the extracted features from the CNN towards classification and bounding box predictions? Yes, we can. The last convolution layer has a rich set of feature maps that contain spatial information with a reduced size but with an increased depth. It’s a massive array of the order 13x18x2048 in the case of the Inception CNN Model. We can use 1×1 Convolution filters to classify each of the grid cells into a class. In the YOLO approach, each image is divided into 9 smaller equivalent grids, and we can use this 1×1 filter for this task – Give us the classification result for each of the grid cells. Initially, the image is divided into a 9×9 i.e SxS grid, but later on, the value of S is increased to 17. From the same dense tensor of 13x18x2048, we can attach more Conv layers or Fully Connected layers for the bounding box predictions. The YOLO uses a single scale activation map for both classes and bounding boxes. Also, YOLO will output the confidence for each bounding box and give us the prediction for all the classes for every bounding box formed. Later, we can sort them and choose the class with the best probability. One more thing is if we use multiple scales in the activation layer then the mAP can be increased. We will consider at a later point in time in the Advanced Object Detection Series. 

With some understanding of improvement opportunities and sneak peeks into YOLO, let’s now understand things in detail. First things first, the image and its size. Here the shape of the image is 448x448x3, quite big. The architecture consists of 24 convolutional layers followed by 2 fully connected layers. The 1×1 convolutional layers are placed at every alternate layer to make sure the dimensions after convolution don’t increase too much also. The entire network is pre-trained on ImageNet weights with the image sizes being 224×224. The architecture outputs a tensor of the shape 7x7x30. The 144fps version or the Fast YOLO uses just 9 convolutional layers which increases the speed at the cost of accuracy obtained in terms of mAP. And unlike the other object detectors we have seen till now, YOLO sees an image only once and the information is captured and maintained in such a way that we can manipulate it for both classification and localization tasks very easily. The below image illustrates the architecture of YOLO.

Image Credits – YOLO Research Paper

The YOLO Architecture

Now, comes the question: How are we extracting information from those tensors obtained at the end of a pass through the network? Each grid cell will predict N bounding boxes and confidence scores for all those N boxes. On top of this, as we mentioned earlier, for each box, we get probabilities of the object inside the box for all the classes. The confidence scores will give us the metric of how confident the model is for that particular object class. The authors had set a formula to calculate the confidence scores – <Confidence Score Formula>. And each bounding box outputs 5 values – x, y, w, h, C. Here (x, y) denotes the center of the bounding box for the individual grid cell that the bounding box lies within. x, y, w, h are all normalized within the w and h of the image and lie in the range of [0, 1). The Confidence score will reflect the IntersectionOverUnion (IoU) score between the predicted box and the ground truth box. 

Coming to the classification aspect, again each bounding box will output a probability P(C|O), where C – Conditional class probability and O – Object. The P(C|O) is limited to only 1 per grid cell. Explaining the7x7x30 tensor shape – This shape is when the PASCAL VOC dataset was used with S =7, B = 2, and C = 20 (Labelled Classes) in the formula – SxSx(B*5 +C). The given below image shows us how the architecture was designed to output the (B*5 + C) for each SxS. This vector is produced by 2 Fully Connected layers.

Image Credit – Lilianweng.github.io

The first vertical green box has 2 boxes with 5 parameters each (there is a limit of 2 boxes in every cell). And the second vertical box represents the 20 conditional class probabilities.

Now let us quickly move on to how it was trained and the loss function. The model was trained with LeakyReLU activation for all the layers except the last layer which used a linear activation. Here, there is a multi-part loss function combining the classification and localization loss which is optimized for the sum-squared error (because it is very simple and easy to optimize). Now, when we combine the losses there is an imbalance created when we consider the fact that many grid cells might not contain an object which pushes the scores to zero and hence overpower the gradients in those cells where there is an object which might cause instability and divergence while training. To fix this problem, the authors increased the loss of bounding box predictions and decreased the loss of confidence scores only for the boxes which did not contain any object at all. The parameters used were ƛcoord = 5 & ƛnoobj = 0.5. One more important thing to notice is that the sum-squared error is considered for width and height to reduce the impact of small deviations in larger boxes than small deviations in smaller boxes. Given below is the multi-part loss function for the YOLO architecture.

Image Credits – YOLO Research Paper

Loss Function – YOLO

Here, B: No. of Bounding Boxes, S: Grid Size, C: Particular Class, ⇑iobj denotes if any object appears in that cell i, and ⇑ijobj denotes the jth bounding box predictor in the cell i responsible for the prediction. This particular loss function penalizes the classification error only if an object is present in the grid cell and penalizes the bounding box error only if that particular predictor is responsible for the ground truth box. A bounding box predictor is assigned to be responsible for predicting an object with the highest IoU with the ground truth box. The network was trained for 135 epochs with a batch size of 64, momentum of 0.9 on the PASCAL VOC 2007 dataset. To prevent overfitting, dropout and data augmentation were used. The Non-Maximum Suppression algorithm is used to filter out the boxes which have an IoU greater than a threshold. This adds up to 3% to the mAP score. The architecture’s results had more localization errors than the classification errors which is exactly opposite to the Fast R-CNN errors.

YOLO proposes up to 98 bounding boxes as opposed to 2000 from the Selective search, with its mAP at ~64% with up to 45fps performance! The Faster YOLO which is a smaller network achieves an mAP of ~53% with 155fps (phenomenal speed). Furthermore, the YOLO when trained using VGG-16 offers a higher mAP of ~66% but compromises in terms of fps with 21fps. When applied on the VOC 2012 dataset, the scores are less than R-CNN. The network generalizes very well since it was trained on natural images and turns out it performs well for artwork and for objects from other domains too.

Of course, the architecture has its drawbacks – Quite down in terms of accuracy in comparison with other state-of-the-art systems and it struggles with detecting & localizing small objects. It does struggle to generalize to new aspect ratios. This might be since the features are coarse because of the downsampling layers. 

From the YOLO, began the new era of real-time object detectors! This architecture was fundamental in this space and acted as a strong base for improvements to be made on top. Moreover, it removed the dreaded Selective Search which was a real big bottleneck in our previous object detection architectures. I recommend reading the paper to understand more details on the various experiments and training hyperparameters. Can we improve the performance of YOLO? Yes we can! A newer YOLO9000 (YOLO v2) model was developed shortly which could recognize more than 1000 classes and was faster and more accurate than YOLO, which we will discuss in our next post. Until then try implementing your own Custom YOLO and do put down your thoughts on this article in the comments section. 


Pranav Raikote


  1. YOLO Paper: https://arxiv.org/pdf/1506.02640.pdf
  2. Video Explanation: https://www.youtube.com/watch?v=9s_FpMpdYW8
  3. From Slide 50 onwards: https://cse.iitkgp.ac.in/~adas/courses/dl_spr2020/slides/10_Detection_Segmentation.pdf
  4. CVPR Presentation Video: https://youtu.be/NM6lrxy0bxs
  5. Original YOLO implementation details: https://pjreddie.com/darknet/yolo/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s