Welcome back to the Advanced Object Detection blog post series! In our previous posts, we had a thorough understanding of YOLOv1, YOLOv2, YOLO9000 & the SSD Multibox detector. All are State-of-the-Art detectors that outperform each other brilliantly. In this post, we shall talk about another one of them – RetinaNet. RetinaNet is quite different from the YOLOs & SSD in a few aspects, the main one being the loss function. The RetinaNet employs a Focal Loss function that focuses less on soft or easy negatives and focuses more on hard samples. This was the class imbalance problem observed in training an object detector. The architecture uses an FPN (Feature Pyramid Network) with ResNet as the backbone CNN outperforms the Faster R-CNN and won the Best Student Paper Award in ICCV (International Conference on Computer Vision) 2021.

Let’s understand one of the main problems in Single-Stage or Multi-Stage Object Detectors – The class imbalance between background and foreground objects which is of interest for detection. We end up having many instances of proposals that are in the background which will be taken as negative instances while training which will be sampled at a ratio to main balanced instances while training. This is called Hard Negative Mining (this topic was touched upon in our previous blogs). This problem was solved to some extent using sampling heuristics (RPN) to bring down the number to a minimum. But, in single stage detectors, the network must process a larger set of proposals that is efficient and accurate. The Region Proposal Network also has some issues which we have seen in our blogs of R-CNN Family of Object Detectors i.e. the feature map is not able to capture the details of small objects because the depth of the layers is either too much or the sub-sampling loses a lot of information while propagating through the layers. Now, it’s time to dive into RetinaNet, understand the architecture and how it solved the above problems for designing a State-of-the-Art Object Detector.

The RetinaNet shares similarities with previous architectures of object detectors – use of anchors, feature pyramids (SSD), and Feature Pyramid Network. What makes the difference is the novel loss function i.e. Focal Loss. This was introduced to solve the class-imbalance problem which was quite emphasized in single-stage detectors. We shall build-up to the Focal Loss starting from the most basic classification loss functions – Cross-entropy loss. Below given is the Cross-entropy loss or commonly known as log loss, which is one of the most fundamental loss functions in the world of Deep Learning.

CE Loss(y, p) = -ylog(p) – (1-y) log(1-p)

In the object detection aspect, we can assume y ∈ {0, 1} is a binary ground truth label to depict whether a bounding box contains an object or not and p ∈ [0, 1] is the predicted probability of the object in a bounding box or not. Now, let’s represent this in a manner that will be easy to comprehend later. We define pt = p, if y=1 and pt = 1-p, otherwise and hence the loss function now becomes CE Loss(p, y) = CE Loss(pt) which is further equivalent to -log(pt). When we consider examples where the prediction is very close to 0 or 1, the loss is still quite a lot to optimize. These are the examples well classified with high confidence. The most common method to address this problem is by adding a weighting factor 𝛂 ∈ [0,1] for positive class and 1 – 𝛂 for negative class. This is called Balanced cross-entropy loss given by,

Balanced CE Loss(p_{t}) = – 𝛂 log(p_{t})

The alpha weighing balances the positive/negative examples and does not differentiate between easy/hard examples. In comes the Focal Loss. The focal loss adds a weighting factor such that the loss is very minimal and the easy examples are down-weighted. This weighting factor in mathematical form is given by (1-pt)𝛄. Now the CE Loss function becomes,

FocalLoss (p_{t}) = -1(1-p_{t})^{𝛄} log(p_{t})

Gamma (𝛄) values greater than 0 reduce the relative loss for well-classified examples and more focus is given on the hard and misclassified examples. Given below is the graph with various values of gamma added to the Cross-entropy loss function.

When gamma = 0, FocalLoss is equal to CE Loss. What we can observe from the above graph is when pt is very small, the curve nearly unchanged but flattens when pt nears 1, which is exactly the motive – Focus on misclassified examples (where the examples’ pt → 0). Setting gamma as 2 yielded the best results. The authors added an alpha term to improve it even further and make it consistent with the usage of the sigmoid activation function which resulted in greater numerical stability.

𝛂-Balanced FocalLoss (p_{t}) = -𝛂_{t}(1-p_{t})^{𝛄} log(p_{t})

The RetinaNet architecture used 𝛂 = 0.25 & 𝛄 = 2 for training. The initialization of loss function in a highly unbalanced setup might lead to instability as the frequent class loss will dominate the net loss. The concept of prior was introduced for the value of p for the background classes which was denoted by 𝝿 and was set very low (0.01). This was found to improve the training in the initial stages. We shall move onto the actual architecture of RetinaNet.

RetinaNet is a unified network consisting of a backbone CNN network and 2 task-specific subnets – Classification Subnet & Box Regression Subnet. The backbone is a normal CNN which computes the convoluted feature map for each image and then is passed onto the subnets. The authors used a Feature Pyramid Network (FPN) to tackle objects with various scales and shapes. The FPN adopts a pyramidical way to extract both low-level and high-level features with lateral connection and output a feature-rich multi-scale pyramid map from a single image. This FPN is built on top of a ResNet architecture. Note that any robust CNN can be used here with minor modifications.

Let’s understand the FPN in a more detailed way. The below image will be used as a reference for explaining. The bottom-up pyramid is a normal feedforward configuration. Magic is happening through the top-down pyramid – Adding semantically strong feature maps back into the levels of the pyramid with the larger size. The higher-level features are upscaled, passed through a Conv layer of 1×1 to reduce its dimension and this feature map is merged via element-wise addition. This is the general working of FPN. One point to note is the lateral connections are happening in the latter stages of the network.

Coming back to the architecture of the RetinaNet, the authors used the FPN’s pyramid with levels P_{3} to P_{7}. The backbone network ResNet has 5 Conv blocks, with P_{3} to P_{5} corresponding to residual stages C_{3} to C_{5}, P_{6} is obtained by a Conv 3×3 (Stride-2) over C_{5} and P_{7} is the application of ReLU and Conv 3×3 (Stride-2) on P_{6}. The pyramid layer P_{2} isn’t used for computational reasons. Predictions are generated from all the pyramid levels and have a constant 256-dimensional channel (since they share the classifier and bounding box sub-networks). Given below is the RetinaNet architect as described in the paper.

Coming to the anchors used in RetinaNet, three aspect ratios were selected – {1:2, 1:1, and 2:1} at each level. The area of anchors ranges from 32^{2} to 512^{2} in pyramid levels P3 to P7. At each level, the anchors of size {2^{0}, 2^{⅓}, 2^{⅔}} are added. There are up to 9 anchors in each level and cover a scale ranging from 32-813 pixels. Each anchor is associated with a k-length one-hot vector (k-classes) which is used for classification and a 4-length vector for bounding box targets. The anchors are assigned based on the IoU (Intersection over Union) threshold of 0.5. If an anchor has an object, the kth index in the vector is labeled as 1, and the rest all are zeros. The bounding box targets are calculated as an offset between each anchor and its object box.

The Classification subnetwork is a Fully Connected Network (FCN) which has connections from each of the pyramid levels. The parameters are shared across the layers. This sub-network applies four 3×3 Conv layers, each with C filters (C=256) followed by ReLU activations. Post this, it is again processed by a 3×3 Conv layer with K.A filters (K-no of classes and A-no. of anchors at each level which is 9). Sigmoid activations follow and the binary predictions are outputted. These features computed are not shared with the bounding box sub-network which is contrary to the original RPN network used in Faster R-CNN. The Bounding Box Regression subnetwork is also an FCN attached to each pyramid layer to regress the offset of the anchor box from the ground truth box. This subnet shares similarities in design with the classification sub-net except that it stops at 4.A linear outputs. Both the subnetworks share a common structure but use separate parameters. At inference, only the top 1000 scoring predictions per FPN level were used to speed up the process. The best predictions from all the layers were merged using the non-max suppression method.

The RetinaNet was trained using the 𝛂-Balanced FocalLoss function on the ResNet-101-FPN & ResNet-50-FPN networks (pre-trained on ImageNet). The total focal loss per image was computed as the sum of focal loss for all the 100k anchors normalized by the no. of anchors assigned for each ground truth box (A=9). It was very important to initialize the training with the prior (𝝿) otherwise, the loss values will blow up during the initial phase of the training and will diverge instead of converging to an optimal solution. The optimizer used was Stochastic Gradient Descent (SGD) with a learning rate of 0.01 which was changed at 60k and 80k iterations. Only horizontal flipping was used as part of the Data Augmentation based regularization for a better fit. The overall Loss function for which the network was optimized is the sum of FocalLoss and smooth L1 loss (Bounding box regressor subnetwork). The RetinaNet outperforms SSD, R-FCN, and FPN architectures and further surpasses the performances of all 2-stage object detectors with larger scales. ResNet-101-FPN with an image size of 600 pixels is at the same performance level as that of ResNet-101-FPN based Faster R-CNN. It achieved an mAP of 40.8 on the MS COCO dataset with the ResNeXt-101-FPN as the backbone.

There were many experiments and hyperparameter optimizations made on the base benchmark of results. I recommend reading the paper, especially the Experiments, Ablation studies, and Appendix for cooler stuff! One thing I have noticed from studying a handful of research papers is the above-mentioned subheadings in the paper will contribute to building intuition on how to experiment and improve the performance of the architecture. Major comparisons and benchmarks will be written in this section. Don’t miss this at any cost as there are a few more interesting observations in the paper that I have left for you to explore 🙂

Saying this, we have now reached the end of the comprehensive understanding of RetinaNet! Quite a feeling, isn’t it? In our next post, we will come back to the YOLO family of object detectors which are better than YOLOv2. You heard it right, we are venturing towards YOLOv3 and YOLOv4. Until then, try implementing or building this network via neural network frameworks like Tensorflow and PyTorch (no offense to the others, just my preference). Put down your thoughts about this article in the comments below and do mention any architecture or method you need explanation and I shall come back with a detailed explanation that is simple and straightforward to understand.

### Author

Pranav Raikote

### References

- RetinaNet Paper: https://arxiv.org/pdf/1708.02002.pdf
- FPN Paper: https://arxiv.org/abs/1612.03144
- ResNet Paper: https://arxiv.org/abs/1512.03385
- Official Keras implementation: https://keras.io/examples/vision/retinanet/#training-the-model
- PyTorch implementation: https://github.com/yhenon/pytorch-retinanet