Welcome to the Applied Singularity blog. Use this Contents post to browse through the full list of articles and Guided Learning Modules we have created or find specific topics of interest.
In our previous article, we had a detailed look at the YOLO architecture, why it is famous and performs so well - up to 45fps with an accuracy of more than 65%! Like any other architecture, it also had some flaws which needed to be solved to break the 45fps barrier and to better the mAP of 65%. Coming to the drawbacks of YOLO v1, it used to get the localization wrong when the objects appeared in a different aspect ratio and failed to detect multiple small objects like a flock of birds. Let’s see how the YOLOv1 was improved to a better, faster, and stronger YOLO with an mAP of more than 78%, which is a huge improvement over YOLO v1 in terms of accuracy but with a slower speed of 40fps. But, at 67fps YOLO v2 performs at 76% mAP trained on the PASCAL VOC 2007. And we will also understand why we had the title as YOLO9000!
Hello, I’m back with an Advanced Object Detector article! We are now past the amateur stage if and only if you have read through all our previous articles on Object Detection, R-CNN, Fast R-CNN, SPPnet, Faster R-CNN & Mask R-CNN. Although Mask R-CNN is also a tad advanced, we considered learning related to the R-CNN Family of Object Detectors. We are on the right track to mastering Object Detection after learning the building blocks of an Object Detector and have a fair intuition about the mechanism of how various parts are put together for a fully functional Object Detector. We progressed through the R-CNN, Fast R-CNN & Faster R-CNN architectures and understood how evolution happens to improve the accuracy and reduce the inference time taken per image.
When learning-based algorithms were not nearly as good as they are today, this problem was mainly handled by handcrafted techniques, but they had their limits - after all if we don’t see something too well, how could we tell what’s there? And this is where new learning-based methods, especially TecoGAN, come into play. This is a hard enough problem for even a still image, yet this technique is able to do it really well even for videos.
In this short article, we will look at the state of egocentric videoconferencing. Now, this doesn’t mean that only we get to speak during a meeting, it means that we are wearing a camera, which looks like the Input (in below video). The goal is to use a learning algorithm to synthesize this frontal view of us, you can see the recorded reference footage, which is the reality (Ground Truth). This real footage (Ground Truth) would need to be somehow synthesized by the algorithm, the predicted one from this algorithm (Predicted). If we could pull that off, we could add a low-cost egocentric camera to smart glasses and it could pretend to see us from the front, which would be amazing for hands-free videoconferencing.