In our recent blog posts on R-CNN and Fast R-CNN, there was one more famous architecture for Image Classification, Object Detection & Localization. It was the first runner-up in Object Detection, 2nd Runner Up in Image Classification, and 5th Place in Localization Task at the ILSVRC 2014! This feat makes it one of the major architectures to study on the subject of Object Detection and Image Classification. The architecture is SPPnet – Spatial Pyramid Pooling network. In this article, we shall delve into SPPnet only from an Object Detection Perspective.
SPPnet was released shortly after R-CNN and it improved the bounding box prediction speed and had a similar mAP when put next to the R-CNN. An important feature of SPPnet was that the condition of having a fixed input image size from the R-CNN was lifted! Image size could be anything, and the network still worked flawlessly, making the architecture model agnostic of input image size. To understand why that is significant, let’s understand why the fixed input size is compulsory for a Convolutional Neural Network.
Convolution layers always compute and output a feature map that’s proportional to a specific ratio called the sub-sampling ratio. This constraint of fixed size isn’t because of the Conv layer, but due to the Fully Connected Layers. The FC Layers always have a fixed-length vector input. To solve this problem, the authors replaced the last pooling layer with a Spatial Pooling layer. Now, you may think “How can a Pooling layer solve this as it also has a fixed window size and stride values?” Hmm, this is strange right? The answer to this is a special way of pooling i.e. Spatial Pyramid Pooling. Usually, a CNN will have a single layer of pooling or no pooling before the FC Layers, but here the authors introduced multiple variable scale poolings that are concatenated to form a 1-d vector for the FC-Layer. As shown in the below image, the SPPnet had 3 layers of Pooling of different scales.
Considering there are 256 feature maps from the last Conv layer,
- Each feature map is pooled into 1 value forming a 256-d vector
- Each feature map is pooled into 4 values forming a 4×256-d vector
- Each feature map is pooled into 16 values forming a 16×256-d vector

The SPP Layer output is flattened to form a 1-dimensional vector and sent to the FC Layer. This eliminates the cropping of the input image to a fixed size before inputting to a CNN. One can apply the SPP Layer to any CNN architecture, but due to the limitations of CNN architectures back in 2014, the authors applied it to AlexNet, Overfeat, and ZF-Net with minute modifications to padding to get the required feature map output.
The authors then took advantage of the varied input size and trained it with sizes of 180×180 and 224×224 to enhance the robustness of the network. A 4-level SPPnet was used of scales 6×6, 3×3, 2×2, and 1×1. There was a decrease in error rate with just the SPP Layer and it further improved with Multi-size training (Training with different input sizes). There was cropping from 4 corners and center, and the image was flipped to produce a total of 10 images from a single image. This is multi-view, this approach was used extensively in testing.
You might be thinking, “All this is fine but how did it improve Object Detection?”
The authors used the SPP mechanism for object detection in an improved approach. Rather than sending the 2000 region proposals one by one to the CNN model, they projected the regions onto the Feature map obtained from the 5th Conv Layer. Just to clear the thought of similarity between the approaches of SPP-net & Fast R-CNN, the SPP-net was published in Apr 2014 and the Fast R-CNN in Apr 2015. We’ll see the differences between Fast R-CNN & SPPnet at the end.
This eliminates 2000 passes the CNN needs to undergo for each image. Now, suddenly from 2000, it is just 1. However, the Selective Search continues to be the bottleneck because it needs to generate 2000 proposals. These regions are sent forward to the SPP Layer for pooling into a 1-d vector. This reduced the computation time to a great extent. The time taken for a test image inference on a GPU was well within 1s and was massively fast in comparison to R-CNN and on par with accuracy too.

On the PASCAL VOC 2007 dataset, the SPPnet got an accuracy of ~59%, which was higher than the ~54 % of the R-CNN. And on the ImageNet dataset, SPPnet was able to achieve an mAP of ~35% when compared to ~31% for R-CNN. The below image illustrates the difference between the pipelines of R-CNN and SPPnet (partial illustration).

With SPP-net, although there isn’t a considerable increase in the mAP from the R-CNN, the speed has certainly increased while maintaining the accuracy of R-CNN. Coming to the drawbacks, the training was still multi-stage (which was solved by the Fast R-CNN) and there wasn’t a substantial jump in the accuracy compared to R-CNN. Hold tight for the next post on Faster R-CNN where the entire Object Detection was not decoupled like R-CNN, SPP-net, and Fast R-CNN. The time-consuming Selective Search was done away with and a Region Proposal Network was introduced.
Until then, I would like to recommend looking at the SPPnet paper for more details on the Image Classification aspects. This is the Image-Net published presentation for the SPP-net paper as an extra resource for your learning. Do try your hand at implementing this on a small-scale dataset. I’m sure you may encounter some interesting observations.
In our next post, we will continue with our explanations of the Fast R-CNN Family with the Faster R-CNN architecture. Until then, keep learning and share your thoughts on this post.
Author
Pranav Raikote
References
References:
- SPPnet Paper : https://arxiv.org/pdf/1406.4729.pdf
- SPPnet Presentation Slides : http://image-net.org/challenges/LSVRC/2014/slides/sppnet_ilsvrc2014.pdf
- AlexNet Paper : https://papers.nips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
- Overfeat Paper : https://arxiv.org/abs/1312.6229
- ZF-Net Paper : https://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf