Today, a variety of techniques exist that can take an image that contains humans and perform pose estimation on it. This gives us interesting skeletons that show us the current posture of the subjects shown in the given images. Having a skeleton opens up the possibility for many cool applications, for instance, it’s great for fall detection and generally many kinds of activity recognition, analyzing athletic performance, and much, much more.
Let’s think bigger: Can we reconstruct not only the pose of the model but the entire 3D geometry of the model itself which may include the body shape, face, clothes, and more?
Current approaches related to image-based 3D human shape estimation have demonstrated the potential in real-world settings but they still fail to produce reconstructions with the level of detail often present in the input images. The authors argued that this limitation stems primarily from two conflicting requirements: accurate predictions require a large context, but precise predictions require high resolution. Due to memory limitations in current hardware, previous approaches tend to take low-resolution images as input to cover large spatial context, and produce less precise (or low resolution) 3D estimates as a result. The authors of the paper PIFuHD addresses this limitation by formulating a multi-level architecture that is end-to-end trainable. A coarse level observes the whole image at a lower resolution and focuses on holistic reasoning. This provides context to a fine level which estimates highly detailed geometry by observing higher-resolution images.
This method builds on the recently introduced Pixelaligned Implicit Function (PIFu) framework, which takes images with a resolution of 512×512 as input and obtains low-resolution feature embeddings (128×128). To achieve higher resolution outputs, authors stack an additional pixel-aligned prediction module on top of this framework, where the first module takes as input higher resolution images (1024×1024) and encodes into high-resolution image features (512×512). The second module takes the high-resolution feature embedding as well as the 3D embeddings from the first module to predict an occupancy probability field. To further improve the quality and fidelity of the reconstruction, authors first predict normal maps for the front and back sides in image space and feed these to the network as additional input. See Figure 1.
Let’s have a look together and evaluate it with three, increasingly more difficult experiments.
Let’s see the reconstructed model for still images (Figure 2. Front side reconstructed model). I think if I knew these people, I might have a shot at recognizing them solely from the 3D reconstruction. And not only that, but I also see some detail in the clothes – a suit can be recognized and the jeans have wrinkles. This new method uses a different geometry, a representation that enables higher-resolution outputs, and it immediately shows. It is clearly working quite well on still images. Now let’s see for other experiments.
It can not only deal with still images of the front side only, but it can also reconstruct the back of the person. You can see that (Figure 2. Back reconstructed model) the back part of the data is completely unobserved. We haven’t seen the back, so how is it even possible to reconstruct it?
An intelligent person would be able to infer some of these details, for instance, we know that this is a suit or that these are boots, and we know roughly what the backside of these objects should look like. This new method leans on an earlier technique by the name Image to Image Translation to estimate this data and it worked very well. If you take a closer look, you see that we have less detail in the back than in the front, and it’s obvious as that portion is completely unobserved.
So now let’s see whether it will be able to do video reconstruction or not.
It does the job but obviously, there is still quite a bit of flickering. The key idea here is that the new method performs these reconstructions in a way that is consistent, or in other words, if there is a small change in the input model, there will also be a small change in the output model.
This is the property that opens up the possibility to extend this method to videos.