In our recent blog posts on R-CNN and Fast R-CNN, there was one more famous architecture for Image Classification, Object Detection & Localization. It was the first runner-up in Object Detection, 2nd Runner Up in Image Classification, and 5th Place in Localization Task at the ILSVRC 2014! This feat makes it one of the major architectures to study on the subject of Object Detection and Image Classification. The architecture is SPPnet - Spatial Pyramid Pooling network. In this article, we shall delve into SPPnet only from an Object Detection Perspective.
In the previous post, we had an in-depth overview of Region-based Convolutional Neural Networks (R-CNN), which is one of the fundamental architectures in the Two-Stage Object Detection pipeline approach. During the ramp-up of the Deep Learning era in around 2012 when AlexNet was published, the approach of solving the Object Detection problem changed from hand-built features like Haar features and Histogram of Oriented Gradients approaches to the Neural Network-based approach, and in that mainly the CNN-based architecture. Over time it has been solved via a 2-Stage approach, where the first stage will be mainly based on generating Region Proposals, and the second stage deals with classifying each proposed region.
Object Detection is one of the most sought after sub-disciplines under Computer Vision. The fact that it's extensively utilized in major real-world applications has made it extremely important. When humans perceive, we have an innate cognitive intelligence trained daily to acknowledge and understand what we see through our eyes. Object detection is one of the advanced methods of how a computer tries to match the power to perceive and understand things around, the primary steps being Image Classification and Localization. Each object will have its own set of varying characteristics that are challenging for a Deep Learning Model/Architecture. It is a different ball game altogether to build an efficient and accurate Object Detector. Let's quickly have a short tour of the extensions and key concepts under Computer Vision before diving in deep on Object Detection.
I recently came across a paper "CLEVRER" ("CoLlision Events for Video REpresentation and Reasoning", by - Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, Joshua B. Tenenbaum). It intrigued me, so I wanted to share some thoughts about it. With the advancements in NN-based learning algorithms, many of us are wondering … Continue reading CLEVRER: CoLlision Events for Video REpresentation and Reasoning