In our last post, we had a quick overview of Object Detection and the various approaches and methods used to tackle this problem in Computer Vision. Now, it’s time to dive deep into the popular methods of building a State-of-the-Art Object Detector. In particular, we shall focus on one of the earliest methods – Region-Based Convolutional Neural Network Family of Object Detectors. The reason it is called R-CNN is that with modifications to a CNN architecture in terms of structuring or adding auxiliary networks or layers, the Object Detector was built, albeit not achieving the state of the art performance. R-CNN is the best way to start in the Object Detection space.

### Introduction to R-CNN

R-CNN, short for Region-based Convolutional Neural Networks, was first introduced in 2014 and has over 15000 citations today. It is one of the fundamental breakthroughs for Object Detection, and performed way better than any other implementations at that time. There are certain stages put together subtly. Let’s look into the overall architecture and then understand the different parts of the R-CNN Architecture in detail. Given below is the overall high-level architecture of R-CNN where sub-parts are: Generating Region Proposals, Extraction of Features using a Pre-trained network, Linear SVM for identifying the Class, and Bounding Box Regressor for Localization.

Coming to the initial step of the pipeline which is extracting Region Proposals, there are various techniques available for this task like Sliding Windows, Colour Contrast, Edge Boxes, Super Pixel Straddling, Selective Search, etc. Extracting Region Proposals is the process of sampling various cropped regions of the image with an arbitrary size which may or may not have the possibility of the object being inside the cropped region. Here, the Selective Search Algorithm was used in the R-CNN as it was found to be more effective and outputs up to 2000 category independent regions per image. Refer this for learning about the Selective Search Algorithm in depth. Selective Search is also known as a class-agnostic detector and is often used as a preprocessor to produce a bunch of interesting bounding boxes that have a high chance of containing a particular object. Since it is class-agnostic, we need to have a special classifier at the end for knowing the actual class to which the output bounding box that contains the object belongs. One important preprocessing step to be performed is Image Warping to the fixed predefined input size of the CNN, which is its innate requirement. The below images gives us a glimpse of Selective Search and the proposal boxes generated.

Next up is the Feature Extraction phase, where the authors used AlexNet as a pretrained network (which was popular then) to generate a 4096-dimensional vector output for each of the 2000 region proposals. Here we can use the Pre-trained AlexNet by removing the last softmax layer for generating the feature vectors, and then fine-tune the CNN for our distorted images and the specific target classes. The labels used are the ground-truths with the maximum IoU (Intersection over Union) which are in the positive category, the rest others are negative labels (for all classes). So, the output from this Feature Extraction Phase is a 4096-dimensional feature vector.

The vectors generated are used to train a Linear SVM for classifying and getting the class of the object. Here we need an individual SVM for each of the object classes we are training for. For each feature vector we have n outputs where n is the total number of classes we are considering and the actual output is the confidence score. Based on the highest confidence score we can make the inference of Object Class(es) given in a particular image. Given below is the graphical representation of the Feature Vector & SVM Computations Matrices.

The final stage is the Localization aspect of Object Detection. A regression model with an L1/L2 Loss Function is attached to predict the bounding boxes coordinates. This Bounding Box Regression is optional and was added later to the Original R-CNN implementation to increase the localization accuracy. The reason this was tried out at a later stage is that the Region Proposals already are a type of bounding boxes. We need to input the ground truth bounding box coordinates also while training this stage. The reason for the low accuracy observed i.e. ~45% was due to the warped images which contributed to the loss as images would appear distorted and stretched. To counter this, they fine-tuned the network by training using an n-layer softmax output layer. This increased the accuracy by 10%. One more problem encountered here was that the model might predict multiple bounding boxes for a single image, say around 5 considering a single object is in the image. Here a Greedy approach of iteratively sorting and selecting the boxes with the IoU confidence scores helps overcome the overlaps of multiple boxes and the single best bounding box coordinates are predicted.

It was then experimentally found that a Bounding Box Regressor helped to get the predicted bounding boxes closer to the ground truth coordinates. This led to a jump of at least 10% in accuracy and later when the VGG network was used in place of AlexNet, the accuracy reported was close to 66 %. The R-CNN achieved a mAP of 54% on the PASCAL VOC 2010 and 31% on the ImageNet datasets.

### Conclusion

And finally, we are through! We learned the R-CNN Architecture in detail and understood the various stages and the techniques employed to solve the problems faced during the development of this model. Find here the R-CNN Paper. I’d recommend reading the full paper to get an exhaustive in-depth understanding of R-CNN and understand the various experiments and observations which are really really interesting!

In the next post, we will revisit R-CNN’s drawbacks and understand how it was overcome, which gave rise to faster Object Detection architectures. Until then, share your thoughts on this post and think about why the R-CNN had major drawbacks and wasn’t adequate for real-world deployment.

### Author

Pranav Raikote

### References

- R-CNN Paper : https://arxiv.org/abs/1311.2524
- Selective Search Paper : http://www.huppelen.nl/publications/selectiveSearchDraft.pdf
- Refer Slides – Other Computer Vision Tasks (From Slide 17 & 53) : http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
- AlexNet Paper : https://papers.nips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
- VGG-16 Paper : https://arxiv.org/abs/1409.1556
- R-CNN : https://leonardoaraujosantos.gitbook.io/artificial-inteligence/machine_learning/deep_learning/object_localization_and_detection#rcnn