When learning-based algorithms were not nearly as good as they are today, this problem was mainly handled by handcrafted techniques, but they had their limits - after all if we don’t see something too well, how could we tell what’s there? And this is where new learning-based methods, especially TecoGAN, come into play. This is a hard enough problem for even a still image, yet this technique is able to do it really well even for videos.
In this short article, we will look at the state of egocentric videoconferencing. Now, this doesn’t mean that only we get to speak during a meeting, it means that we are wearing a camera, which looks like the Input (in below video). The goal is to use a learning algorithm to synthesize this frontal view of us, you can see the recorded reference footage, which is the reality (Ground Truth). This real footage (Ground Truth) would need to be somehow synthesized by the algorithm, the predicted one from this algorithm (Predicted). If we could pull that off, we could add a low-cost egocentric camera to smart glasses and it could pretend to see us from the front, which would be amazing for hands-free videoconferencing.
In today's world, everybody can make Deepfakes by recording a voice sample. So let's understand what this new method can do by example. Let’s watch the below short clip of a speech, and make sure to pay attention to the fact that the louder voice is the English translator. If you pay attention, you can hear the chancellor’s original voice in the background too. So what is the problem here? Honestly, there is no problem here, this is just the way the speech was recorded.