Welcome back to another interesting read on the latest Advanced Object Detector architectures – the YOLOv4. YOLOv4 is the latest and one of the strongest state of the art object detectors now in the industry. Without wasting much time let’s get straight into the YOLOv4 and understand why and how it became the new state-of-the-art with an mAP of ~45% @ 65 fps which is quite very real-time with a good performance.
What changed from YOLOv3? Some cool and interesting concepts are packaged into the YOLOv4 which makes it a well-oiled architecture running with Optimal Speed & Accuracy! Since there will be a lot of new things coming up we shall define a few terminologies or an outline for an object detector in common and then extend it for our understanding of YOLOv4. Any object detector has a few structural parts as given below,
- Input: Images and image pyramids
- Backbone: Major CNNs networks
- Neck: Neural network blocks like Spatial Pyramid Pooling (SPP), Feature Pyramid Network (FPN), Fully-connected FPN
- Heads: Region Proposal Network (RPN), SSD, YOLO, RetinaNet (Dense Predictions which are the single-stage detectors) and Faster R-CNN, Mask R-CNN, Fast R-CNN (Sparse Predictions which are the two-stage detectors)
Coming to the few key concepts used in building the architecture, we have two big virtual bags of concepts going on called the Bag of Freebies which deal with increasing accuracy with concepts like Data Augmentation, Cost function, etc., and the Bag of Specials which has improvements on inference time like a receptive field, skip-connections & FPN, post-processing, etc.
Coming to the bag of freebies, we will talk about Data Augmentation which makes the model robust and generalize better. In terms of photometric distortion, we have brightness, contrast, hue, saturation and for geometric distortion, random scaling, cropping, flipping, rotation, skewing, and shearing are used. Researchers even tried object occlusion masks as a method of augmentation which is technically blanking a box area of the image with zero pixel values randomly. This is quite similar to DropOut normalization only applied to images directly! There was Mixup which is a type of augmentation wherein a new image is formed via weighted linear interpolation of the two existing images, and also CutMix where patches are cut and pasted randomly in the training images where the ground truths are also mixed. CutMix improves the model robustness against input corruptions.
The bag of freebies also contains various methods of solving the class imbalance problem – focal loss, label smoothing, etc, and the objective function of the bounding box regressor itself. Traditional methods use a Mean Squared Error (MSE) to estimate the center point coordinates and then get the offsets. Instead of that, what if we directly want to estimate the coordinates of the bounding box? Researchers proposed an IoU loss that considers the areas of the predicted and ground truth boxes. There were improvements in IoU too where we got introduced to GIoU loss which includes the shape and orientation of the object too in addition to the area.
Moving on to the Bag of Specials, the modules here are focusing on improving the overall accuracy of the object detector with a very small tradeoff in inference cost (time taken to predict a new sample). SPP-block modules can be used to enhance the receptive field, and the attention mechanism modules to enhance the power of a ResNet-50 by 1% top-1 accuracy while increasing the computational costs by only 2%. Skip connections were integrated to put together a rich feature map carrying semantic information of low and high-level features. One significant concept here to take note of is the BiFPN, which receives the multi-input weighted residual connections to execute scale-wise level re-weighting and then add up the maps of different feature scales. This is sort of an improvement over the lightweight multi-scale predictions like FPN.
The next important thing here is the activation function. It’s fundamental in deep learning that the right activation function is used which optimizes to global solution in a shorter period. And when the ReLU activation was excellent, work was happening over it to get extended versions of ReLU in the form of LReLU, PReLU, SELU, Swish, Mish, etc. Now Mish is particularly important to YOLOv4, as it’s a novel self-regularized non-monotonic activation function which is defined as f(x) = x.tanh(softplus(x)). Mish is excellent as it is unbounded above where it avoids saturation and being bounded below helps in strong regularization effects. Then comes the important post-processing in an object detector – filtering the bounding boxes. The most popular technique is Non-max suppression (NMS). Now that we have some understanding of the specific techniques and modules under the Bag of Freebies and Bag of Specials, we are ready to put them together and see how the authors made YOLOv4 work crazily well!
The main architecture of YOLOv4 consists of a CSPDarknet53 backbone, an SPP block, a PANet path-aggregation neck, and a YOLOv3 head. We observed that in TinyYOLO the backbone had 9 convolutional layers, it was less accurate but faster. In YOLOv3 the Darknet53 which was the backbone had 53 convolutional layers which was much more accurate but slower. The CSPDarknet53 i.e. Cross Stage Partial Darknet53 is a novel backbone used in YOLOv4 which is derived from the DenseNet architecture. In the CSPNet architecture, the layer will split into two paths – one path which will go through a block of dense convolutions and the other path which will skip and be concatenated at the end of the other path. The below images illustrate the CSP architecture in comparison with the DenseNet.
CSP-type architectures are quite lightweight and can be trained on a single GPU (Nvidia 1080Ti or 2080Ti). This is one of the striking points of YOLOv4 – easily trainable on a single GPU! Next comes the neck, which is used to combine feature maps from various stages and make the features ready for the detection stage. In YOLOv4, a Spatial Pyramid Pooling module is appended to the backbone. Head over to our Blog Post 4 for an in-depth understanding and working of the SPP Layer. SPP Layer will enable the usage of variable input image size by using multilevel spatial bins and can generate a fixed n-dimensional vector regardless of the input size. It allows the network to be robust for various image sizes and scales.
There is one more technique included in the neck i.e. the Path Aggregation Network (PANet). What is this path aggregation? Earlier the CNNs were strictly linear and needed no concatenation operations, addition operations which we see nowadays in the ResNets, DenseNets, etc. We have blocks, skip connections, aggregations of data flowing through the layers. These new-age techniques are called parameter aggregation models. We are not new to these techniques used in Object Detectors, the Feature Pyramid Network (FPN), Spatial Attention Module (SAM), and Spatial Pyramid Pooling (SPP). YOLOv4 uses a modified PANet and SPP in its architecture. Back to the PANet, it simply has a bottom-up pathway on top of an FPN (FPN has only top-down connections). The below image will throw light on the difference between FPN & PANet.
The modification done to PANet was instead of an addition operation (element-wise) through the connections, the operation performed is concatenation. At this stage, we have the architecture like this – CSPDarkNet53 + PANet + SPP Layer. Now the final stage or the head is nothing but YOLOv3’s head without any modifications at all. Link to Blog Post on YOLOv3 here for reference. So, finally, we covered the architecture of YOLOv4, but we are not done yet. Hang in there for a few more minutes to understand how the authors made this architecture a truly state-of-the-art object detector.
Many experiments were performed to improvise the performance of the architecture and of course help in training and post-processing too. Coming to the Bag of Freebies, which is a collection of concepts/techniques which impact the performance without adding too much time while inferencing. It consists mostly of Data Augmentation and Regularization techniques. About the backbone, CutMix, Mosaic Augmentations, DropBlock regularization, and Class Label smoothing techniques are used. CutMix is a method where images are randomly cropped and pasted on top of other images. This was used in Image Classification networks also and is well known. The newly introduced Mosaic Augmentation tiles 4 images together which emphasizes learning the objects at a smaller scale without giving importance to the surroundings. DropBlock is something similar to DropOut regularization, only it acts on a whole patch of filters in the Convolutional Layers instead of neurons connecting the layers. It will drop units in a contiguous region of a feature map are dropped. This seemed to improve the regularization effect than DropOut in Convolutional Neural Networks.
The class label smoothing also acts as a regularization factor, by adding a controllable uniform distribution to the one-hot encoding of target labels when the loss function is cross-entropy. The model can get quite overconfident on its predictions, so have a factor 𝝰, when set to 0 gives a perfect one-hot encoding function and when set to 1 gives a perfect uniform distribution. The reason why it works is, one-hot encoded labels always encourage the largest gap between logits, which will make the model too confident and might overfit also. The smoothed labels give smaller gaps between the logit predictions which leads to better model calibration and prevents overfitting. The formula for class label smoothing is y = (1 – 𝝰) * y_hot + 𝝰/K where K is the no. of classes. The below image will illustrate the CutMix, DropBlock, and Mosaic Augmentations.
The techniques taken from the Bag of Freebies for the detector are CIoU-loss, Self-Adversarial Training, Cross-mini-batch Normalization, DropBlock, Mosaic Augmentations, and other hyperparameters. CIoU Loss function for bounding boxes as explained earlier focuses on the loss between the ground truth and predicted box. Self-Adversarial Training is a new data augmentation that operates in 2 forward-backward stages. In the first stage, the input image size is varied but the weights are kept constant which is a type of an adversarial attack on itself. In the second stage, the network is trained to detect the object on the modified image. CmBN is a modified Cross Batch Normalization technique where the updation and scaleShift operations are performed at the end of 4 mini-batch instead of at the end of each mini-batch. A cosine function is used to update the learning rate which aids in getting out of local minima more easily. And to find the optimal hyperparameters, genetic algorithms are used. N randomly selected parameters are initialized, trained and the best K models are selected. Now, for the N selected models again the hyperparameters are initialized from the trained K models’ parameters. This continues until the final iteration is reached.
There are few more techniques applied from the Bag of Specials that modify a few key components of the network. For the backbone, Mish activation, Cross-stage partial connections, and Multi-Input weighted residual connections (MiWRC). For the detector techniques like SPP-block, PANet block, and a modified SAM block. The original Spatial Attention Module focuses on transforming the output of a feature map via a MaxPooling and AvgPooling layer, which are concatenated and passed onto the next layer. In YOLOv4, it is modified in such a way that it skips the MaxPooling & AvgPooling layers.
All these experiments were tested on the ImageNet validation dataset for Image Classification and the MS COCO test-dev dataset for detector accuracy. The initial training steps were around 80,00,000 with the batch size, with a polynomial decay learning rate with few more hyperparameters. This was for the Image Classification ImageNet experiments. For the MS COCO detection experiments, the training steps are around 5,00,500, with an initial learning rate of 0.01. All architectures were trained on a single-GPU with a batch size of 64. I recommend looking into the paper for more intricate details about training and the ablation studies performed. Finally, the below image will give us the performance of YOLOv4 on the MS COCO dataset and achieves state-of-the-art results compared to other detectors.
Let me pause here and clap for staying with us right till the end in understanding this state-of-the-art object detector. The architecture is quite complex with an amalgamation of various concepts. My suggestion for studying the research paper would be to go slow and study the various techniques one by one, and it will be a nice experience.
And now, we are at the end of this article and the series – Advanced Object Detection. Hope you had a good time learning the most famous and advanced state-of-the-art Object Detectors. We shall come back soon with our next post on ____. It’s a surprise! Till then, try implementing the YOLOv4 and see if you can improve and innovate. It should be trainable on a single GPU (NVIDIA 1080Ti/2080Ti) with a GRAM of 8/16 GB. Turn to Google Colaboratory or Kaggle for accessing GPUs and training your object detectors. Put down your comments on this post and share it if you liked and have the potential to help someone interested in this space.
- YOLOv4 paper: https://arxiv.org/pdf/2004.10934.pdf
- YOLOv3 paper: https://arxiv.org/pdf/1804.02767.pdf
- CSPNet paper: https://arxiv.org/pdf/1911.11929v1.
- PANet paper: https://arxiv.org/pdf/1803.01534
- FPN paper: https://arxiv.org/abs/1612.03144
- SPP paper: https://arxiv.org/abs/1406.4729
- EfficientDet paper: https://arxiv.org/pdf/1911.09070
- DropBlock paper: https://arxiv.org/abs/1810.12890
- Spatial Attention Module paper: https://arxiv.org/abs/1807.06521v2
- PyTorch implementation: Implementation 1, Implementation 2
- TensorFlow implementation: Implementation 1, Implementation 2, Implementation 3