Welcome back to the Object Detection Series. In our previous articles, we understood few limitations of R-CNN and how SPP-net & Fast R-CNN have solved the issues to a great extent leading to an enormous decrease in inference time to ~2s per test image, which is an improvement over the ~45-50s of the R-CNN. But even after such a speedup, there are still some flaws as well as enhancements that can be made for deploying it in an exceedingly real-time 30fps or 60fps video feed. As we know from our previous blog, the Fast R-CNN & SPP-net are still multi-stage training and involve the Selective Search Algorithm for generating the regions. This is often a huge bottleneck for the entire system because it takes plenty of time for the Selective Search Algorithm to generate ~2000 region proposals. This problem was solved in Faster R-CNN – the widely used State-of-the-Art version in the R-CNN family of Object Detectors. We’ve seen the evolution of architectures in the R-CNN family where the main improvements were computational efficiency, accuracy, and reduction of test time per image. Let’s dive into Faster R-CNN now!
The major bottleneck we’ve seen in R-CNN, Fast R-CNN, and SPP-net is the Selective Search Algorithm. It takes around 2s per image for generating the proposals and that too runs on CPU. Whether or not we’ve got a GPU, there’ll be time lost while sending it to the GPU for further processing when it goes through the CNN network. The Faster R-CNN paper features a Region-Proposal Network which will bring down the time of 2s to 10ms, which will also be as accurate (and sometimes better) than Selective Search. Let’s see how the Selective Search was replaced in the Faster R-CNN.
The Faster R-CNN has a unified model with two sub-networks – Region Proposal Network (RPN), which is a Convolutional Neural Network for proposing the regions, and the second network is a Fast R-CNN for feature extraction and outputting the Bounding Box and Class Labels. Here, the RPN serves as an Attention Mechanism in the Faster R-CNN pipeline. Let’s understand the importance of RPN and how it is replacing the Selective Search Algorithm. Given below is the pictorial representation of RPN in Faster R-CNN.
Region Proposal Network within the Faster R-CNN architecture. Image Credits – Faster R-CNN paper
In the Faster R-CNN network, there exists one backbone CNN network, and the output features are utilized by both RPN and the Object Detector Network, which is the Fast R-CNN. The Region Proposal Network uses a Sliding Window approach i.e. slides a window with a specific size all over the feature map and generates ‘k’ Anchor Boxes of different shapes and sizes. Given below is the set of Anchor Boxes generated by the RPN per image. By default, the value of k is 9 with 3 different scales of 128×128, 256×256, and 512×512 sizes with 3 aspect ratios of 1:1, 1:2, and 2:1. The below images give us a concept about the Anchor Boxes and the RPN’s Sliding window in action.
Anchor Boxes’ Configurations within the Faster R-CNN’s RPN network. Image Credits – TowardsDataScience
Sliding Window generating the k Anchor Boxes – RPN. Image Credits – GeeksForGeeks
The task of RPN is to predict the possibility of an anchor being background or foreground (containing the object). While training, the input image should be accompanied by the ground truth set of anchor boxes and improve the region proposals by training this network. Considering a feature map of 40×60, 9 anchor boxes or proposals will result in 20k proposals generated which remains a large number. The authors included a Softmax Layer from which we get the confidence scores, rank them, and take just the top-n anchor proposals. An anchor is considered positive (Presence of object) based on either of the two conditions – Anchor has highest IoU (Intersection over Union – Measure of Overlap) with the ground-truth box or the Anchor has an IoU of greater than or equal to 0.70 with any ground truth box. On the other hand, an anchor box is negative if the IoU is less than or equal to 0.30. The remaining anchors are discarded for training. If we sample all anchors, there might be a bias towards negative samples. To solve this problem, 128 positive and negative samples are selected randomly.
In addition to the binary softmax classifier, there exists a Linear Regression layer which outputs x, y, w, h coordinates of the anchor (x, y – Center of the Anchor, w – Width and h – Height). This is applied only if the anchor is predicted as positive. All cross-boundary anchors are discarded as they don’t contribute much to the optimization. We will revisit the loss functions and training procedure at a later point in this article. Now, with the RPN explained, the detailed Faster R-CNN pipeline will look as shown below:
The Detailed Faster R-CNN Architecture. Image Credits – TowardsDataScience
The different sized proposed regions generated by the RPN are fed to the ROI Pooling Layer. Refer to the Blog Post on Fast R-CNN for a better understanding of ROI Pooling. Here various dimensional representations are pooled into a k-dimensional vector which is in turn given to the Softmax Classification Layer and Bounding Box Regressor Layer. Apart from the RPN, the remainder of the architecture is a Fast R-CNN as the detector network. With the architectural understanding, now let’s see how we train this network with its multiple loss functions.
The RPN is optimized for the given below multi-task loss function. It consists of classification loss combined with regression loss. In the loss function, pi, pi* are the predicted probability of anchor i being an object and the ground truth label whether anchor i is the object respectively. The Lcls is again a log loss function with 2 classes – sample is the target object versus not.
The regression loss uses a smoothing L1 function. Here ti and ti* are the four coordinates and the ground truth coordinates respectively.
The Loss functions of RPN
The Ncls is a normalization term set to the mini-batch size which is 256 and the Nbox is also a normalization term set to the number of anchor boxes (~2500). The λ is set to 10, which is a balancing parameter such that Lcls and Lbox are weighted equally. The RPN is trained via end-to-end backpropagation and standard Stochastic Gradient Descent with a learning rate of 0.001. But, we want to optimize both the RPN and the Detector network to share the convolution layer features, which will decrease the inference time. The authors came up with a 4-step training procedure which enables the learning utilizing the shared features via alternating optimization.
The RPN is trained first independently, with pretrained ImageNet weights and fine-tuned end-to-end for the Region Proposal task. In the next step, the Detector network of Fast R-CNN is trained and fine-tuned end-to-end using proposals generated by the trained RPN. The Fast R-CNN’s layers are also initialized with ImageNet pre-trained weights. Now, we use this trained Fast R-CNN detector network to initialize RPN training and fine-tune only the RPN specific layers, the other layer weights are frozen. From here on, the convolutional layers are shared between both the networks. In the final step, we again fine-tune the specific or unique layers of the Fast R-CNN. Now, we’ve got a unified model with both networks sharing the Convolution layers. “This procedure can be repeated, but there was no significant improvement”, said the authors.
The Faster R-CNN achieves an mAP of 66.7% on the PASCAL VOC 2007 dataset and up to ~79% when trained on PASCAL VOC 2007, VOC 2012, and the COCO datasets. The inference time decreases to ~0.2 s per image when the RPN is used when compared to the ~2.5s without the RPN-based Fast R-CNN. Thus, we can conclude that the RPN contributed marginally to the mAP and greatly sped up the process. There were many experiments conducted to fine-tune the number of proposals and the datasets used in combinations for training. I suggest reading the full Faster R-CNN paper for all the intricate details. There are few interesting observations too, do give the paper a read. In December 2015, a Faster R-CNN with a ResNet-101 backbone network won the COCO Object Detection Competition and is considered one of the State-of-the-Art Networks for Object Detection to date! The Faster R-CNN was extended to a Pixel-Level Image Segmentation in 2017 which is the popular Mask R-CNN utilized in many real-world applications. We will discuss the Mask R-CNN in continuation of discussing the R-CNN Family of Object Detectors in our next blog post. Until then, try implementing the Faster R-CNN and put down your thoughts and observations of Faster R-CNN in the comments below.
Author
Pranav Raikote