Object Detection is one of the most sought after sub-disciplines under Computer Vision. The fact that it’s extensively utilized in major real-world applications has made it extremely important. When humans perceive, we have an innate cognitive intelligence trained daily to acknowledge and understand what we see through our eyes. Object detection is one of the advanced methods of how a computer tries to match the power to perceive and understand things around, the primary steps being Image Classification and Localization. Each object will have its own set of varying characteristics that are challenging for a Deep Learning Model/Architecture. It is a different ball game altogether to build an efficient and accurate Object Detector. Let’s quickly have a short tour of the extensions and key concepts under Computer Vision before diving in deep on Object Detection.
First comes Image Classification where the task is to spot the pixels contributing to a particular class label for a given dataset. The next level is Object Localization in which the goal is to detect the presence of objects and put a bounding box to depict the location. Bounding Box is a 2D Colored rectangle drawn on the image which shows the location of the detected object in the image. We can see multiple boxes with labels in the image shown above which is detecting dog, person, traffic lights etc. Building on top of this is Object Detection where we identify the class/label of the bounding box with a confidence score. Object Detection can be thought of as having two levels – Single Object Detection (where the detector is looking to detect one single object for which it is trained) and Multi-Object Detection (which will be trained to detect a multitude of objects in a single image). The highest level is Object Segmentation or Semantic Segmentation where we mark the pixels of every object rather than a bounding box. Here, overlapping objects are labeled quite accurately.
There are two main Object Detection models: Multi-Stage Object detector and Single Stage Object detector. The Region-based Convolutional Neural Networks (R-CNN) family is a Multi-Stage type as it involves more than one stage (two sub-stages). The SSD (Single Shot Detector) and the YOLO (You Only Look Once) families are of the type Single Stage detector as they classify and give the bounding box per image in a single network or single stage. Let’s briefly inspect these popular architectures.
The R-CNN consists of three modules – Region Proposal, Feature Extractor, and Classifier. The Region Proposal module deals with generating approximate bounding boxes (region proposals). The features of those region proposals are extracted using Deep Convolutional Neural Networks. Finally, the features are classified using a linear SVM classifier. The selective search algorithm is employed to come up with 2000 region proposals per image. The R-CNN works at a speed of around 7 FPS which is sort of slow when deployed in a real-world scenario. There are faster and more efficient versions of R-CNN namely Fast R-CNN and Faster R-CNN, which reduce the detection time from 49 seconds in the R-CNN to 0.2s from Faster R-CNN.
The Single Shot Detector comprises only 2 modules: Extraction of feature maps in the first module and then in the second module applying the Convolutional filters for Object detection. (Convolutional filters are the key building blocks of any CNN which helps in detecting the image contours and outputting corresponding features).The SSD applies 3×3 convolution filters for every cell to generate the predictions. Each Convolutional filter outputs 25 channels, 21 channels for each class, and one channel for the bounding box. In a typical SSD, we use a modified VGG16 as the Convolution Neural Net for computing the predictions. The SSD architecture consists of six more extra auxiliary layers stacked on top of the initial VGG16. The network makes a total of ~8700 predictions coming from the 6 added layers. We get a slighter higher FPS at the speed of 22-49 fps, which is quite suitable for real-time deployments. Given below is a pictorial representation of SSD Architecture.
The YOLO approach involves one neural network trained end to end that takes a picture input and outputs both bounding boxes and class labels for each bounding box. The predictive accuracy is low but it gives a throughput of 44-155 FPS based on the type of model. The input image is split into a grid of cells, where each cell is responsible for predicting a bounding box if the center of a bounding box falls within it. Each cell will predict a bounding box with x, y coordinates, height, width, and the confidence score. There are faster and newer versions of YOLO namely YOLOv2, YOLOv3, and YOLOv4 which perform better.
And Voila! We have had a fast tour of Object Detection and various methods of approaching the Object Detection problem. Sit tight and look forward to the following post for an in-depth review of each of the algorithms listed above. Until then, continue learning and share your thoughts on this article.
- R-CNN Paper : https://arxiv.org/pdf/1311.2524.pdf
- Fast R-CNN Paper : https://arxiv.org/pdf/1504.08083.pdf
- Faster R-CNN Paper : https://arxiv.org/pdf/1506.01497.pdf
- SSD Paper : https://arxiv.org/pdf/1512.02325.pdf
- YOLO Paper: https://arxiv.org/pdf/1506.02640v5.pdf
- Object Localization & Detection : https://leonardoaraujosantos.gitbook.io/artificial-inteligence/machine_learning/deep_learning/object_localization_and_detection
- Object Detection : https://livebook.manning.com/book/deep-learning-for-vision-systems/chapter-7/v-8