In our previous article, we had a detailed look at the YOLO architecture, why it is famous and performs so well – up to 45fps with an accuracy of more than 65%! Like any other architecture, it also had some flaws which needed to be solved to break the 45fps barrier and to better the mAP of 65%. Coming to the drawbacks of YOLO v1, it used to get the localization wrong when the objects appeared in a different aspect ratio and failed to detect multiple small objects like a flock of birds. Let’s see how the YOLOv1 was improved to a better, faster, and stronger YOLO with an mAP of more than 78%, which is a huge improvement over YOLO v1 in terms of accuracy but with a slower speed of 40fps. But, at 67fps YOLO v2 performs at 76% mAP trained on the PASCAL VOC 2007. And we will also understand why we had the title as YOLO9000!

The first step of solving the drawbacks is using Batch Normalization, which increased the mAP by a full 2%. The regularization helps in the better fitting of the network. BatchNorm was applied after all convolutional layers. Post this experiment dropout layers were removed completely which aided in a slightly faster training as we all know adding dropout to a network will take an average of  2.5x times more than a network without dropout. The next major jump in mAP was brought in by using a higher resolution input of 448×448 for fine-tuning after starting with 224×224 for training YOLO v2. YOLO used 224×224 for training and upscaled it directly to 448×448 for detection or inference. The approach used in YOLO v2 helped improve the mAP considerably by 4%. 

There were major changes to the architecture of YOLO. To start, the Fully Connected layers were removed to bring a rather used and classic concept of using Anchor Boxes for predictions. Remember the Anchor Boxes and predicting the offsets from Faster R-CNN? The same concept is applied here. Instead of predicting the coordinates, we just predict the offsets for the predicted anchor boxes. The anchor boxes are outputted directly from the convolution layers when we remove the FC layers. This removes the compulsion of having fixed image sizes in input too. The authors made a bold assumption – objects tend to be in the center of an image. So, to make it a single-center location in the anchor boxes we need an odd number of locations in the feature maps. The input resolution was downsized to 416×416 and a pooling layer was removed just to get a 13×13 feature map after getting 32x downsampled occurring through all the convolutional layers. Now, using this setup, YOLOv2 will predict the offset coordinates and the confidence score for each anchor box. The IoU (Intersection over Union) of the ground truth box and the predicted box is computed along with the class-conditional probabilities. Now the big question is “Why Anchor Boxes”? The anchor will allow multiple objects to be detected in a single grid cell and that too objects of different aspect ratios. This is overcoming the restriction we had in YOLO – Only one class of object can be detected per grid cell. By using the anchor boxes, there was a slight decrease in mAP but the recall was increased from 81% to 88% by 7% i.e. it increased the percentage of positive cases. Below images illustrate the use of anchor boxes.

Anchor boxes examples and 5 anchor boxes. Image Credits – jonathan-hui

The older YOLO used to predict around 98 boxes per image but with anchor box modification, the number is more than 1000. But there was a problem with these anchor boxes. The initial boxes are hand-picked but to overcome this, a simple k-means clustering algorithm is used to find good priors (anchor boxes). The k value was set to 5 as it gave a good tradeoff for recall vs complexity of the model. The below image shows the experiments, the clusters over both the datasets – VOC 2007 and COCO.

Clustering box dimensions on VOC and COCO. Image Credits – YOLO9000 Research paper

We don’t want just any 5 nearest anchor boxes, but the best IoU scored 5 anchor boxes. So we use a distance metric d(box, centroid) = 1 – IoU(box, centroid) which gave the best results. Upon adding the Dimension Clusters, there was a big improvement of 5% to the mAP. So these were the improvements regarding the Bounding Boxes and we already saw 2 improvements with respect to training – Batch Normalization and the change in the input resolution. Let’s see 2 more such improvements in the training aspect of YOLO.

YOLO v1 struggles to detect small objects due to the loss of a few semantic features. Now somehow the authors had to solve this and give more features. As we all know, Conv layers reduce the spatial dimensions gradually and when it decreases, the smaller objects get lost. A new approach was taken here – reshape the feature vector 26x26x512 to 13x13x1024 and concatenate to another 13x13x1024. We now get a bigger and larger  feature vector and convolutions are applied to this big feature vector for detection. This gives access to fine-grained features to the network which increases the mAP by 1%. One more scope for improvement is the absence of FC (Fully Connected) layers, which means we can use different sized input images. During training, for every 10 batches, a random new image size with the factor of 32 is chosen. The reason for choosing 32 is that the model downsamples by a factor of 32. So, we have options ranging from 320, 352, … 608 at the max size. This will enable a better generalization for the network. This will give us multiple performance points with their respective frame rate capabilities. The highest mAP is 78.6 @ 40fps (544×544 Image size) and the lowest mAP is around 69% @ 91fps (288×288 Image size). The below graph depicts the various mAPs and FPS outputs for the YOLOv2 along with comparisons to other State-of-the-Art Object Detection models only on the VOC 2007 dataset.

Accuracy and FPS of various Object Detectors. Image Credits – YOLO9000 Research paper

Wow! That is a super improvement over YOLO in many aspects, but still, we are not done yet. There was a substantial performance improvement but still it is very slow in terms of training time. Majority of object detectors used VGG-16 as the base CNN which required up to 30.6 billion operations for a single pass of 224×224 sized images. YOLO uses the GoogleNet architecture which makes about 8.6 billion operations for a single pass which is very less compared to VGG-16 but comes at the cost of accuracy. YOLO’s top-5 accuracy (Top-5 accuracy means that any of your models that give 5 highest probability answers that must match the expected answer) was around 88.0%, VGG-16 had an accuracy of 90%. The authors introduced a new network – Darknet-19, which has some resemblance to VGG-16 by its 3×3 filters and double the channels after every pooling step. 1×1 convolutions to compress the features between the 3×3 filter operations. It requires around 5.5 billion operations per image and achieved a top-5 accuracy of 91.2%. When trained on the higher resolution of 448×448, this network achieved an accuracy of 93.3%. For more details on the training experiments and procedure, I recommend reading the paper here. The below image illustrates the Darknet-19 CNN architecture. 

Darknet-19 Architecture. Image Credits – YOLO9000 Research paper

As per the title of the paper which is – YOLO9000 – Faster, Better and Stronger we have seen it is faster and better than YOLO v1, but haven’t yet seen how it became stronger and why the number 9000. The authors proposed a joint mechanism to train both classification and detection datasets mixed. It back propagates through the full YOLOv2 loss function, whereas it back propagates only for classification when it sees an image labeled for classification. The YOLOv2 loss function is composed of three parts – 4 coordinates (x, y, w, h), P(obj) that an object exists in the bounding box, and the conditional probability Ci = P(obj belongs to i-th class | object exists in this box). The mathematical formula is given below,

Loss for x, y, w and h
The formula for  P(obj) – First term is loss for responsible bounding boxes and the second term for non-responsible ones
Loss for Conditional Probabilities Ci

 This Loss Function was taken from a Tensorflow implementation of YOLO v2 here. In the paper, they have not talked about the loss function in a mathematical context. Coming back to the joint training mechanism, this initially posed a major problem when we mix both classification and detection datasets. For example, we have dog, car, tree labeled in a general way, whereas ImageNet has a variety of breeds. One way to solve this is to make it a multi-label classification to combine datasets that do not assume exclusion. The authors made it a hierarchical classification having a structure called WordTree. Rather than explaining WordTree which can be quite complex, let me show the representation.

Image Credits – YOLO9000 Research paper

As we observe, it is a very clever way to use the WordNet directed graph structure and build a hierarchical tree to relate the classes and subclasses. In total if we add all the classes we get a total of 9418 classes! (Full ImageNet release + MS COCO). Now finally we understand that from the number of classes, the architecture got its name as YOLO9000! Let’s see an example of how a Norfolk terrier which is a dog undergoes the class probability calculation.

Image Credits – YOLO9000 Research paper

The authors trained the Darknet-19 model on the WordTree of just ImageNet classes. Post this exercise, the other intermediate nodes were added and trained. It achieved a top-5 accuracy of 90.4% which is really good despite the network predicting tree-structured data. It had one major benefit: suppose the object was Norfolk terrier – even if the network was not sure of the breed, it said dog with very high confidence. The below image depicts the difference between Softmax calculations on just ImageNet 1000 classes and WordNet 1000 classes. One point to note from the image is how the softmax under the WordNet is applied multiple times over co-hyponyms i.e. similar typed entities. For example, wave, snow, and cloud are co-hyponyms.

Image Credits – YOLO9000 Research paper

The way loss functions are used to correct the errors remains the same as YOLOv2 but there is a small change in the anchor boxes. Here only 3 are used instead of 5 boxes. The YOLO9000 achieved an mAP of 19.7%. Credit to the authors that they attempted to build a real-time object detector for more than 9000 classes! 

Here we are at the end of this article which was pretty lengthy but was worth its time as we understood YOLOv2 and YOLO9000 in depth. Now, YOLOv2 is a state of the art architecture for real-time object detection across a variety of datasets and can run images of different sizes too. It overhauled and beat the YOLOv1 in all aspects.YOLOv2 is a stepping stone in bridging the gap between Object Classification and Detection. I will request my readers to go through the YOLO9000 paper for more details on training experiments. I would urge everyone to try a hand at building this and training for your desired dataset on top of pre-trained YOLOv2. In our next post we will step into a slight parallel universe of Single Shot Detectors which is also a single stage object detector just like the YOLO family of Object Detectors. If we actually see the facts, SSD is also right up there in terms of fps and mAP – well worth giving time and understanding one more State-of-the-Art object detector. Until then, try implementing and thinking whether you can propose some novel changes to YOLOv2 and YOLO9000 and whether it improves the performance or not. Put down your thoughts on this post in the comments below.


Pranav Raikote


  1. YOLO9000 Paper :
  2. Presentation Slides :
  3. YOLO v2 Tensorflow implementation :
  4. One more YOLov2 implementation :

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s