In this short article, we will look at the state of egocentric videoconferencing. Now, this doesn’t mean that only we get to speak during a meeting, it means that we are wearing a camera, which looks like the Input (in below video). The goal is to use a learning algorithm to synthesize this frontal view of us, you can see the recorded reference footage, which is the reality (Ground Truth). This real footage (Ground Truth) would need to be somehow synthesized by the algorithm, the predicted one from this algorithm (Predicted). If we could pull that off, we could add a low-cost egocentric camera to smart glasses and it could pretend to see us from the front, which would be amazing for hands-free videoconferencing.
In our previous articles, we understood few limitations of R-CNN and how SPP-net & Fast R-CNN have solved the issues to a great extent leading to an enormous decrease in inference time to ~2s per test image, which is an improvement over the ~45-50s of the R-CNN. But even after such a speedup, there are still some flaws as well as enhancements that can be made for deploying it in an exceedingly real-time 30fps or 60fps video feed. As we know from our previous blog, the Fast R-CNN & SPP-net are still multi-stage training and involve the Selective Search Algorithm for generating the regions. This is often a huge bottleneck for the entire system because it takes plenty of time for the Selective Search Algorithm to generate ~2000 region proposals. This problem was solved in Faster R-CNN - the widely used State-of-the-Art version in the R-CNN family of Object Detectors. We’ve seen the evolution of architectures in the R-CNN family where the main improvements were computational efficiency, accuracy, and reduction of test time per image. Let's dive into Faster R-CNN now!
In our recent blog posts on R-CNN and Fast R-CNN, there was one more famous architecture for Image Classification, Object Detection & Localization. It was the first runner-up in Object Detection, 2nd Runner Up in Image Classification, and 5th Place in Localization Task at the ILSVRC 2014! This feat makes it one of the major architectures to study on the subject of Object Detection and Image Classification. The architecture is SPPnet - Spatial Pyramid Pooling network. In this article, we shall delve into SPPnet only from an Object Detection Perspective.
In the previous post, we had an in-depth overview of Region-based Convolutional Neural Networks (R-CNN), which is one of the fundamental architectures in the Two-Stage Object Detection pipeline approach. During the ramp-up of the Deep Learning era in around 2012 when AlexNet was published, the approach of solving the Object Detection problem changed from hand-built features like Haar features and Histogram of Oriented Gradients approaches to the Neural Network-based approach, and in that mainly the CNN-based architecture. Over time it has been solved via a 2-Stage approach, where the first stage will be mainly based on generating Region Proposals, and the second stage deals with classifying each proposed region.
In our last post, we had a quick overview of Object Detection and the various approaches and methods used to tackle this problem in Computer Vision. Now, it's time to dive deep into the popular methods of building a State-of-the-Art Object Detector. In particular, we shall focus on one of the earliest methods - Region-Based Convolutional Neural Network Family of Object Detectors. The reason it is called R-CNN is that with modifications to a CNN architecture in terms of structuring or adding auxiliary networks or layers, the Object Detector was built, albeit not achieving the state of the art performance. R-CNN is the best way to start in the Object Detection space.