In the previous post, we had an in-depth overview of Region-based Convolutional Neural Networks (R-CNN), which is one of the fundamental architectures in the Two-Stage Object Detection pipeline approach. During the ramp-up of the Deep Learning era in around 2012 when AlexNet was published, the approach of solving the Object Detection problem changed from hand-built features like Haar features and Histogram of Oriented Gradients approaches to the Neural Network-based approach, and in that mainly the CNN-based architecture. Over time it has been solved via a 2-Stage approach, where the first stage will be mainly based on generating Region Proposals, and the second stage deals with classifying each proposed region.
In our previous article on Region-based CNN, we understood the essential building blocks of this breakthrough architecture. If you have studied the R-CNN paper, you will find interesting observations. Before fine-tuning, it was observed that even after adding more Fully Connected Layers there was no difference in accuracy, which meant that the Convolutional Layers have contributed to the accuracy and the FC Layers added very less value. On the other hand, it was found after fine-tuning that the majority of the weights were altered within the fully connected layers, which led to an increase in accuracy. This concluded that the Convolutional Layers captured more generalizable features and the FC Layers captured the specific features. We can experiment with the FC Layers for finding the trade-off between Accuracy, Size of the Model & Inference Time. The authors also said that the utilization of SVM over the fine-tuned CNN for detection was due to 2 main factors – Positive examples do not emphasize the precise location and Negative Examples were based on Easy/Soft negatives rather than Hard negatives. Soft Negatives are regions that contain empty and plain backgrounds and Hard negatives can contain partial objects and be quite noisy too, which is easily misclassified.
The R-CNN had its own set of drawbacks. They were mainly associated with the inference time – which was very slow. The reasons are not one but three! First, the Selective Search will output 2000 region proposals for each image. Second, CNN will extract features (N*2000; where N is the number of images). Third, it’s a fancy complex multi-stage training pipeline of three separate models working sequentially without any shared computation – CNN, SVM, and Bounding Box Regressor. Due to the above reasons, the R-CNN takes around ~45s for giving inference for a single image running on a GPU! This was 9 times slower than the previous best performing model – Overfeat. These are the explanations why the R-CNN wasn’t deployable in real-world or real-time scenarios.
What happened after R-CNN? Was there any improvement to the R-CNN? Could researchers come up with a better solution to this problem? Yes, they did. They tweaked the R-CNN with subtle modifications and behold – Fast R-CNN came into existence! It was accurate and faster than R-CNN by a huge margin having an inference time of just ~2s! That’s an enormous jump from ~45s. Let’s understand the improvements of Fast R-CNN over the R-CNN.
The authors of R-CNN worked on improving some of the shortcomings of R-CNN and found a better way to feed the Region Proposals and save a lot of training and inference time. One of the major bottlenecks was generating 2000 Region Proposals per image, which added up to many numbers of forward passes per image. The R-CNN architecture took ~87 hours for training. The authors found a way to decrease this overhead of computation. How about we generate a single feature map per image and project the Region of Interests (RoIs) on the feature map itself? This avoids the tedious and computationally intensive task of generating feature maps for each of the generated region proposals. So, the modified architecture looks like the image shown below.

Each image is fed to the CNN and at the end of the network i.e. VGG-16, the FC and MaxPooling Layers are removed and we output the generated feature map for the image. The Selective Search will generate the 2000 region proposals which are projected onto the CNN generated Feature matrix. The RoI Pooling is a way to convert the features in the projected region of the image in any size to fixed window size. This makes sure the output dimension is always constant. Note that warping does take place here also, to an extent. The RoI Pooling layer always outputs fixed-length feature vectors and further processing happens on this feature vector/matrix. Then the model branches into 2 output layers – Object Classification and Bounding Box Regression Layers. The softmax layer for classification consists of K+1 class, the +1 is for a background class which outputs a discrete probability distribution for the RoI. The Bounding Regressor predicts the offset for the original RoI for the K classes.
One thing to note here is that the training features a combined learning procedure – fine tuning the CNN, and classification plus regression of the bounding box. The loss function used for the localization task is a smooth L1 loss function. The final loss function is a combination of classification and localization. Hence, the network is back propagated with a single loss and this solves the complex multi-stage non-sharable computation problem. For an in-depth understanding of loss functions and the various parameters while training, I suggest a detailed read of the Fast R-CNN paper.

The Fast R-CNN was fast and reduced the training time to around 9.5 hours but there wasn’t much of an improvement to be seen in terms of mAP because the accuracy stood near 69%. The Fast R-CNN is up to 45 times faster at test time which is a huge improvement.
But, is there still scope for improvement? Yes, there is! Remember the Selective Search needs to generate up to 2000 Region Proposals? That’s a bottleneck and the network is still not yet unified. Well, this bottleneck too was solved in the next iteration of models in the R-CNN Family, which is the Faster R-CNN model. There was one more architecture which came a wee bit before the Faster R-CNN model – SPPnet, which we will discuss in our next post and continue with the Faster R-CNN architecture in the subsequent posts. Until then keep learning, and do share your thoughts on this post.
Author
Pranav Raikote
References
- R-CNN Paper : https://arxiv.org/pdf/1311.2524.pdf
- Fast R-CNN Paper : https://arxiv.org/pdf/1504.08083.pdf
- Slide 71 onwards (Fast R- CNN) : http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
- Fast R-CNN : https://leonardoaraujosantos.gitbook.io/artificial-inteligence/machine_learning/deep_learning/object_localization_and_detection#fast-rcnn