Hello and welcome back to another article in the Advanced Object Detection series! In our last post, we ventured out of the YOLO detectors a bit and touched on RetinaNet architecture which introduced a novel loss function called FocalLoss (& đť›‚-balanced FocalLoss) and solved the huge class-imbalance problem observed in single-stage object detectors. Now, let’s come back to the YOLO object detectors, specifically the YOLOv3. The YOLOv3 had some minor updates on top of the YOLOv2 which made it better and stronger, but sadly not as fast. The authors traded the speed with accuracy – accurate but not so fast. It matched the accuracy of the SSD by 3x faster @ ~22s inference time and higher scaled images (416×416) pushing it to sub 30fps inference times. It even comes close to RetinaNet in accuracy but is way faster. Letâ€™s dig deep and understand the improvements made to YOLOv2 and why itâ€™s slower but more accurate.

What can be the reason for higher accuracy and slower inference time? The first and foremost guess will be – Bigger network architecture. Well, thatâ€™s spot on! YOLOv3 has somewhat of a hybrid design taken from YOLOv2â€™s Darknet-19 and adding some residual blocks. The new architecture uses 3×3 and 1×1 Conv layers in succession with residual blocks. This gives rise to a bigger and stronger network in the form of Darknet-53. (53 Convolutional layers). YOLOv2 had Draknet-19, which was just 19 layers compared to YOLOv3 which is 53. It is obvious that the speed will decrease (cannot match the same performance of a 19-layer network), and the accuracy improvements are also predictable as residual connections are known to improve the performance of CNNs in general. The Darknet-53 matches ResNet-152 for its 93.8% top-5 accuracy and holds up quite well at 78FPS which is 2 times faster than ResNet-152â€™s 37FPS. The 53 Conv layers setup provides a really good accuracy and is faster when compared to 101 and 152 Conv layered networks. An additional, more subtle improvement is the skip-layer concatenation between two layers taken right from the DenseNet where the feature maps are concatenated into a higher-depth map instead of just adding earlier values as observed in the ResNets which does element wise summation. Maybe this is the key factor acting towards an increased accuracy and also contributes to effectively detecting smaller objects due to the dense feature-rich maps. The below image illustrates skip connection via addition & concatenation.

YOLOv3 uses multiple logistic classifiers as opposed to the softmax classification. This was introduced since multiple objects might be detected in the same box – multi-label type classification. A binary cross-entropy loss function is used for the logistic classifiers. Overlapping labels and mutually inclusive labels are handled robustly by this change.

The next improvement is focused on the bounding box predictions. As opposed to the YOLOv2â€™s linear regression calculation for the objectness in each of the bounding boxes, YOLOv3 employs a logistic regression for this task. YOLOv2 uses a sum of squared errors for classification. Using linear regression for the offset predictions led to a decrease in mAP.

Next up is making the YOLOv3 robust for different scales of images. YOLOv3 predicts for 3 different scales. Convolutional layers are added on top of the base network which is similar to a pyramid structure. The last layer will predict a 3-dimensional tensor which has the bounding box, objectness, and class prediction information. When trained on the MS COCO dataset, there were 3 boxes predicted at each scale so the tensor is 1*1*[B*(5 + C)], where B = no. of bounding boxes per feature map, 5 + C stands for (4 bounding box + 1 objectness) + C no of classes. The values of B and C are 3 and 80 respectively which gives us a tensor of dimension 1x1x255. The given below image depicts how the detection happens and the tensor is also illustrated.

The three different scales which are processed for outputting the predictions are images that have been downsampled 32, 16, and 8 times the original size of 416×416. The first scale detection would yield a feature map of size 13x13x255, second and third scaled feature maps would be 26x26x255 and 52x52x255 respectively. The detections are made using a few 1×1 convolutional layers which fuse the information from the previous layers and help in dimensionality reduction. The 13×13 layer is responsible for detecting larger objects and the 52×52 layer is responsible for the smaller objects. The 26×26 layer is quite robust in detecting medium to large-sized objects. The next design element is the selection of anchor boxes. A total of 9 anchor boxes are selected, 3 for each scale for each image. A clustering algorithm – K-Means is used to select the anchors. The anchors are assigned in decreasing order of sizes – the biggest 3 for the bigger scale, the next 3 for medium scale, and the smallest 3 for the last scale. If we consider an image of 416×416 there are a total of 10, 647 bounding boxes that can be generated per image – 10x times of YOLOv2, which contributes to it;â€™s slower inference time.

Coming to the performance comparison with other state of the art object detectors, YOLOv3 fares quite well against SSD (3x faster) and is a little bit behind RetinaNet. The place where YOLOv3 fails is aligning the boxes correctly to the object. IoU threshold of 0.5 is good, but when we increase this threshold, performance drops significantly. Previous YOLOs struggled at smaller-scaled objects but it is reversed in YOLOv3 – Smaller objects are detected better than the larger objects. Something to ponder about thatâ€¦

So, here we are at the end of this article, completing our journey through the YOLOv3 architecture which is faster and better than YOLOv2 and SSD! Quite a few good improvements were made to the previous version of YOLO like – bigger architecture, better feature maps, three scales which led to an increase in bounding boxes per image, and so on. However, there are one or two more things which the authors tried and it did not work so well for YOLOv3. One is the Focal Loss, which decreased the mAP by 2 points, and I will leave the rest as an assignment to read the YOLOv3 paper and understand them.

In our next article, we shall move onto the latest version in YOLO – YOLOv4: Optimal Speed and Accuracy of Object Detection which came out in April 2020. It is 10% more accurate than YOLOv3 and is faster by 12% with an mAP of 44% @ 65 fps. Until our next post, try implementing the YOLOv3 and understand its shortcomings which will start putting ideas in your head on how to improve this architecture which may lead to a new State-of-the-Art Object Detector! Put down your thoughts on this article in the comments below and be sure to leave a like and share with your friends.

### Author

Pranav Raikote

### References

- YOLOv3 Paper: https://arxiv.org/pdf/1804.02767.pdf
- YOLOv2 Paper: https://arxiv.org/pdf/1612.08242.pdf
- ResNet Paper: https://arxiv.org/abs/1512.03385
- DenseNet Paper: https://arxiv.org/abs/1608.06993
- Article on Skip-connections: https://theaisummer.com/skip-connections/
- TensorFlow Implementations: Implementation-1, Implementation-2
- PyTorch Implementation: https://github.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch
- OpenCV Python & C++ Implementation: https://learnopencv.com/deep-learning-based-object-detection-using-yolov3-with-opencv-python-c/