Hi everyone, welcome back to another post in our Object Detection Series! If you have not read our previous posts, I would suggest you do have a look at them to understand this post better. 

In our last blog post, we went through the Faster R-CNN architecture for Object Detection, which remains one of the State-of-the-Art architectures till date! The Faster R-CNN has a very low inference time per image of just ~0.2s (5 fps), which was a huge improvement from the ~45-50s per image from the R-CNN. So far, we have understood the evolution of R-CNN into Fast R-CNN and Faster R-CNN in terms of simplifying the architecture, reducing training and inference times and increasing the mAP (Mean Average Precision). This article is about taking a step further from Object Detection to Instance Segmentation. Instance Segmentation is the identification of boundaries of the detected objects at pixel levels. It is a step further from Semantic Segmentation, which will group similar entities and give a common mask to differentiate from other objects. Instance segmentation labels each object under the same class as a different instance itself. To understand this clearly, have a look at the below image:

Image Credits – Analytics Vidhya

Image Credits – TowardsDataScience

The Mask R-CNN is an extension of the Faster R-CNN and belongs to the R-CNN family of architectures which we have discussed in great detail in our previous posts. Mask R-CNN outperformed all models in all tasks and was even the COCO 2016 Challenge winner! Lets understand why we say Mask R-CNN extends the Faster R-CNN – Mask R-CNN has an extra branch for outputting the Segmentation masks on each Region of Interest (RoI) in a per-pixel way. Therefore, it has three outputs – Class Label, Bounding-Box Offset, and Object Mask for each detected object. To understand the Mask R-CNN, we need to have a solid understanding of Faster R-CNN, which was explained in our previous blog here. Read up on it before continuing further.

Now, the Mask R-CNN constitutes of the original Faster R-CNN network for the Object Detection & Localization tasks. It retains the class label detection branch with softmax activations & the bounding box branch from the Faster R-CNN with slight improvements. The authors tried a better CNN architecture as their Backbone network in the form of ResNet-FPN (Feature Pyramid Network) which incorporates a robust multi-layer RoI extraction mechanism. More on Feature Pyramid Network here. We need a very accurate per-pixel preservation of spatial features for the mask layer and hence a new RoI Align layer was introduced to address the issue of misalignment between RoI and the extracted features caused by quantizations at the RoI Pool layer. The RoI Align layer skips the quantizations by dividing the RoI into 9 equal boxes and applying Bilinear interpolation in each box. 

Bilinear interpolation is a technique to estimate values of a grid location based on the 4 nearest cell centers. This is often used when resampling or projecting data from one cell size to another. The RoI Align fixes the harsh quantizations by using x/16 binning on the feature map instead of [x/16]. The RoI Align layer outputs multiple bounding boxes and warps them into a fixed dimension. These warped features are then fed to the Fully Connected layers for the softmax classification and boundary box predictions (which is refined using the bounding box regressor). The same warped features are also fed to the Mask Classifier which consists of 2 CNN’s to output the binary mask. This network outputs a K*(m*m) mask representation which is upscaled and the channels are reduced to 256 using a 1*1 convolution, where K is the number of classes and m = 28 for the ResNet_FPN network as backbone.  During training, the masks are scaled down to 28*28 for computation of loss and upscaled to the size of RoI bounding box during inference. This makes sure of the fact that there is no competition among the various classes and the mask is purely based on the warped features, and has nothing to do with the class of the object. Below image depicts the Mask R-CNN architecture at an abstract level.

Image Credits – Mask R-CNN paper

Like the Faster R-CNN, Mask R-CNN uses the anchor boxes to detect multiple objects which are of various scales and also overlapped in the image. The filtering of anchor boxes occurs at the IoU value of 0.5. Non-max suppression is used to remove bounding boxes where IoU is less than 0.5. Here, the bounding box which has the highest value of IoU is picked and other bounding boxes are suppressed for identifying the same object.

Coming to the loss functions and training procedures, the Mask R-CNN combines the loss of classification, localization and segmentation mask: L = Lcls + Lbox + Lmask. The Lmask has K*m^2 dimensional output for each RoI. A per-pixel sigmoid is applied and the Lmask is the average binary cross-entropy loss. For an RoI with ground-truth class k, Mask is only defined on the k-th mask (other mask outputs don’t contribute to this loss). In the given below formula, yij is the label of a cell (i, j) in the true mask for the region of m*m size. ykij is the predicted value of the same cell in the mask learned for the class k

Loss function for the Mask branch Image Credits – lilianweng.github.io

The images are resized such that the shorter side is 800 pixels and the network was trained for 160k iterations with a learning rate of 0.02 with a weight decay of 0.0001 and momentum of 0.9. During inference, there are 1000 proposals for FPN on which the class and bounding box prediction branches are run. The highest scoring 100 boxes are sent to the mask branch which speeds up inference and improves accuracy. The Mask R-CNN with its mask branch removed was compared with State-of-the-Art models. The Mask R-CNN using ResNet-101 FPN outperformed all other baseline models for object detection. The Mask R-CNN is pretty fast to train also, taking around 32 hours on 8-GPU setup and upto 44 hrs with the ResNet-101 FPN on 135k images. The Mask R-CNN inference speed is about ~5 fps which is considerable keeping in mind the additional segmentation branch. Given below is an example output for an inference image.

Mask R-CNN output with various colored masks and bounding boxes. Image Credits – TowardsDataScience

I recommend reading the full paper for understanding the various experiments (On Cityscapes and COCO) and modifications made to explore and improve the performance (also called Ablation Studies). This architecture due to its segmentation capabilities was extended to Human Pose Estimation also and is included in the Mask R-CNN paper. 

So, here we are at the end of the R-CNN Family of Architectures. We started with R-CNN and arrived all the way till Mask R-CNN in this post. In our next post, we will enter the YOLO (You Only Look Once) and SSD family of object detectors (Single-stage Object Detectors), which are the current State-of-the-Art with 45-120 fps (various YOLO versions from v1-v5 & Fast-YOLO) performance which is phenomenal for real-world deployments with a very good accuracy. Until then, keep learning and try to implement or fine-tune a Mask R-CNN for your own tasks and put down your thoughts on this article.

Below image illustrates the Summary and Evolution of Models.


Pranav Raikote


  1. Mask R-CNN paper: https://arxiv.org/pdf/1703.06870.pdf
  2. ResNet 101 paper: https://arxiv.org/pdf/1512.03385.pdf
  3. Feature Pyramid Network: https://arxiv.org/pdf/1612.03144.pdf
  4. From Slide 89 onwards: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
  5. Seminar Slides: https://lmb.informatik.uni-freiburg.de/lectures/seminar_brox/seminar_ss17/maskrcnn_slides.pdf
  6. Mask R-CNN at ICCV17: https://www.youtube.com/watch?v=g7z4mkfRjI4

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s