The performance of deep learning models has improved significantly on several computer vision tasks, but yet supervised learning models rely on a large number of labeled images. We know how expensive it is to get high-quality annotations, and this motivates research in other directions, including Active Learning.
In our last blog post, we went through the Faster R-CNN architecture for Object Detection, which remains one of the State-of-the-Art architectures till date! The Faster R-CNN has a very low inference time per image of just ~0.2s (5 fps), which was a huge improvement from the ~45-50s per image from the R-CNN. So far, we have understood the evolution of R-CNN into Fast R-CNN and Faster R-CNN in terms of simplifying the architecture, reducing training and inference times and increasing the mAP (Mean Average Precision). This article is about taking a step further from Object Detection to Instance Segmentation. Instance Segmentation is the identification of boundaries of the detected objects at pixel levels. It is a step further from Semantic Segmentation, which will group similar entities and give a common mask to differentiate from other objects. Instance segmentation labels each object under the same class as a different instance itself.
In our previous articles, we understood few limitations of R-CNN and how SPP-net & Fast R-CNN have solved the issues to a great extent leading to an enormous decrease in inference time to ~2s per test image, which is an improvement over the ~45-50s of the R-CNN. But even after such a speedup, there are still some flaws as well as enhancements that can be made for deploying it in an exceedingly real-time 30fps or 60fps video feed. As we know from our previous blog, the Fast R-CNN & SPP-net are still multi-stage training and involve the Selective Search Algorithm for generating the regions. This is often a huge bottleneck for the entire system because it takes plenty of time for the Selective Search Algorithm to generate ~2000 region proposals. This problem was solved in Faster R-CNN - the widely used State-of-the-Art version in the R-CNN family of Object Detectors. We’ve seen the evolution of architectures in the R-CNN family where the main improvements were computational efficiency, accuracy, and reduction of test time per image. Let's dive into Faster R-CNN now!
In our recent blog posts on R-CNN and Fast R-CNN, there was one more famous architecture for Image Classification, Object Detection & Localization. It was the first runner-up in Object Detection, 2nd Runner Up in Image Classification, and 5th Place in Localization Task at the ILSVRC 2014! This feat makes it one of the major architectures to study on the subject of Object Detection and Image Classification. The architecture is SPPnet - Spatial Pyramid Pooling network. In this article, we shall delve into SPPnet only from an Object Detection Perspective.
Object Detection is one of the most sought after sub-disciplines under Computer Vision. The fact that it's extensively utilized in major real-world applications has made it extremely important. When humans perceive, we have an innate cognitive intelligence trained daily to acknowledge and understand what we see through our eyes. Object detection is one of the advanced methods of how a computer tries to match the power to perceive and understand things around, the primary steps being Image Classification and Localization. Each object will have its own set of varying characteristics that are challenging for a Deep Learning Model/Architecture. It is a different ball game altogether to build an efficient and accurate Object Detector. Let's quickly have a short tour of the extensions and key concepts under Computer Vision before diving in deep on Object Detection.