- Baby Spinach-based Minimal Modified Sensor (BSMS) for nucleic acid analysis
- Object Detection – Part 6: Mask R-CNN
- Object Detection – Part 5: Faster R-CNN
- Object Detection – Part 4: Spatial Pyramid Pooling (SPPnet)
- Object Detection – Part 3: Fast R-CNN
- Object Detection – Part 2: Region Based CNN (R-CNN)
- Object Detection – Part 1: Introduction
- NF-Nets: Normalizer Free Nets
- Castle in the Sky
- Interactive Video Stylization Using Few-Shot Patch-Based Training
- Ensemble Learning in ML – Part 3: Stacking
- Ensemble Learning in ML – Part 2: Boosting
- Ensemble Learning in ML – Part 1: Bagging
- Workflow and Implications of membrane lipids
- Discovery of novel molecular pathways linked to Insulin Resistance
- What is a Digital Twin?
- GATT in Bluetooth Low Energy (BLE)
- PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization
- CLEVRER: CoLlision Events for Video REpresentation and Reasoning

In diseases such as Alzheimer’s, disordered precursor proteins aggregate to form soluble oligomers/plaques and the detection procedure for these plaques is quite expensive. BSMS could potentially serve as a platform for detection of these proteins, as well as those involved in many other diseases.

The importance of BSMS lies in the philosophy of its simple design and versatile applications. MicroRNA (miRNAs) are small, endogenous, non-coding RNA which play an important role in regulating gene expression. In diseases such as cancer, the role of miRNA has been studied extensively. Dysregulation of miRNA biogenesis, whether upregulation or downregulation, has been one of the major factors in proliferation signaling, evasion of growth suppressors, activation of invasion and metastasis in cancers.

miRNAs have been identified as one of the primary biomarkers in human cancer prognosis. Hence successful identification and detection of miRNAs would further help us in development of diagnostic and therapeutic methods.

There are many methods currently available for detection of miRNA which includes microarray, RNA seq, RT-qPCR, Northern blot and *in-situ* hybridization. Each of these methods has their advantages as well as disadvantages. Many of these methods require large quantities of miRNA to detect while others are complicated to analyze. To overcome this BSMS is incorporated for the detection miRNA for that the researchers are undergoing further to increase the sensitivity of BSMS by changing its design.

BSMS can be further developed into a universal sensor by incorporating wide range of sensors for different targets with various sizes. Developing of a high throughput-based sensor has been one key objective for further progress of BSMS.

Asha Guraka

]]>In our last blog post, we went through the Faster R-CNN architecture for Object Detection, which remains one of the State-of-the-Art architectures till date! The Faster R-CNN has a very low inference time per image of just ~0.2s (5 fps), which was a huge improvement from the ~45-50s per image from the R-CNN. So far, we have understood the evolution of R-CNN into Fast R-CNN and Faster R-CNN in terms of simplifying the architecture, reducing training and inference times and increasing the mAP (Mean Average Precision). This article is about taking a step further from Object Detection to Instance Segmentation. Instance Segmentation is the identification of boundaries of the detected objects at pixel levels. It is a step further from Semantic Segmentation, which will group similar entities and give a common mask to differentiate from other objects. Instance segmentation labels each object under the same class as a different instance itself. To understand this clearly, have a look at the below image:

*Image Credits – Analytics Vidhya*

*Image Credits – TowardsDataScience*

The Mask R-CNN is an extension of the Faster R-CNN and belongs to the R-CNN family of architectures which we have discussed in great detail in our previous posts. Mask R-CNN outperformed all models in all tasks and was even the COCO 2016 Challenge winner! Lets understand why we say Mask R-CNN extends the Faster R-CNN – Mask R-CNN has an extra branch for outputting the Segmentation masks on each Region of Interest (RoI) in a per-pixel way. Therefore, it has three outputs – Class Label, Bounding-Box Offset, and Object Mask for each detected object. To understand the Mask R-CNN, we need to have a solid understanding of Faster R-CNN, which was explained in our previous blog here. Read up on it before continuing further.

Now, the Mask R-CNN constitutes of the original Faster R-CNN network for the Object Detection & Localization tasks. It retains the class label detection branch with softmax activations & the bounding box branch from the Faster R-CNN with slight improvements. The authors tried a better CNN architecture as their Backbone network in the form of ResNet-FPN (Feature Pyramid Network) which incorporates a robust multi-layer RoI extraction mechanism. More on Feature Pyramid Network here. We need a very accurate per-pixel preservation of spatial features for the mask layer and hence a new RoI Align layer was introduced to address the issue of misalignment between RoI and the extracted features caused by quantizations at the RoI Pool layer. The RoI Align layer skips the quantizations by dividing the RoI into 9 equal boxes and applying Bilinear interpolation in each box.

Bilinear interpolation is a technique to estimate values of a grid location based on the 4 nearest cell centers. This is often used when resampling or projecting data from one cell size to another. The RoI Align fixes the harsh quantizations by using x/16 binning on the feature map instead of [x/16]. The RoI Align layer outputs multiple bounding boxes and warps them into a fixed dimension. These warped features are then fed to the Fully Connected layers for the softmax classification and boundary box predictions (which is refined using the bounding box regressor). The same warped features are also fed to the Mask Classifier which consists of 2 CNN’s to output the binary mask. This network outputs a K*(m*m) mask representation which is upscaled and the channels are reduced to 256 using a 1*1 convolution, where K is the number of classes and m = 28 for the ResNet_FPN network as backbone. During training, the masks are scaled down to 28*28 for computation of loss and upscaled to the size of RoI bounding box during inference. This makes sure of the fact that there is no competition among the various classes and the mask is purely based on the warped features, and has nothing to do with the class of the object. Below image depicts the Mask R-CNN architecture at an abstract level.

*Image Credits – Mask R-CNN paper*

Like the Faster R-CNN, Mask R-CNN uses the anchor boxes to detect multiple objects which are of various scales and also overlapped in the image. The filtering of anchor boxes occurs at the IoU value of 0.5. Non-max suppression is used to remove bounding boxes where IoU is less than 0.5. Here, the bounding box which has the highest value of IoU is picked and other bounding boxes are suppressed for identifying the same object.

Coming to the loss functions and training procedures, the Mask R-CNN combines the loss of classification, localization and segmentation mask: L = Lcls + L_{box} + L_{mask}. The L_{mask} has K*m^2 dimensional output for each RoI. A per-pixel sigmoid is applied and the L_{mask} is the average binary cross-entropy loss. For an RoI with ground-truth class k, Mask is only defined on the k-th mask (other mask outputs don’t contribute to this loss). In the given below formula, y_{ij} is the label of a cell (i, j) in the true mask for the region of m*m size. y^{k}_{ij} is the predicted value of the same cell in the mask learned for the class k

*Loss function for the Mask branch *– *Image Credits – lilianweng.github.io*

The images are resized such that the shorter side is 800 pixels and the network was trained for 160k iterations with a learning rate of 0.02 with a weight decay of 0.0001 and momentum of 0.9. During inference, there are 1000 proposals for FPN on which the class and bounding box prediction branches are run. The highest scoring 100 boxes are sent to the mask branch which speeds up inference and improves accuracy. The Mask R-CNN with its mask branch removed was compared with State-of-the-Art models. The Mask R-CNN using ResNet-101 FPN outperformed all other baseline models for object detection. The Mask R-CNN is pretty fast to train also, taking around 32 hours on 8-GPU setup and upto 44 hrs with the ResNet-101 FPN on 135k images. The Mask R-CNN inference speed is about ~5 fps which is considerable keeping in mind the additional segmentation branch. Given below is an example output for an inference image.

*Mask R-CNN output with various colored masks and bounding boxes. Image Credits – TowardsDataScience*

I recommend reading the full paper for understanding the various experiments (On Cityscapes and COCO) and modifications made to explore and improve the performance (also called Ablation Studies). This architecture due to its segmentation capabilities was extended to Human Pose Estimation also and is included in the Mask R-CNN paper.

So, here we are at the end of the R-CNN Family of Architectures. We started with R-CNN and arrived all the way till Mask R-CNN in this post. In our next post, we will enter the YOLO (You Only Look Once) and SSD family of object detectors (Single-stage Object Detectors), which are the current State-of-the-Art with 45-120 fps (various YOLO versions from v1-v5 & Fast-YOLO) performance which is phenomenal for real-world deployments with a very good accuracy. Until then, keep learning and try to implement or fine-tune a Mask R-CNN for your own tasks and put down your thoughts on this article.

Below image illustrates the Summary and Evolution of Models.

Pranav Raikote

- Mask R-CNN paper: https://arxiv.org/pdf/1703.06870.pdf
- ResNet 101 paper: https://arxiv.org/pdf/1512.03385.pdf
- Feature Pyramid Network: https://arxiv.org/pdf/1612.03144.pdf
- From Slide 89 onwards: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
- Seminar Slides: https://lmb.informatik.uni-freiburg.de/lectures/seminar_brox/seminar_ss17/maskrcnn_slides.pdf
- Mask R-CNN at ICCV17: https://www.youtube.com/watch?v=g7z4mkfRjI4

The major bottleneck we’ve seen in R-CNN, Fast R-CNN, and SPP-net is the Selective Search Algorithm. It takes around 2s per image for generating the proposals and that too runs on CPU. Whether or not we’ve got a GPU, there’ll be time lost while sending it to the GPU for further processing when it goes through the CNN network. The Faster R-CNN paper features a Region-Proposal Network which will bring down the time of 2s to 10ms, which will also be as accurate (and sometimes better) than Selective Search. Let’s see how the Selective Search was replaced in the Faster R-CNN.

The Faster R-CNN has a unified model with two sub-networks – Region Proposal Network (RPN), which is a Convolutional Neural Network for proposing the regions, and the second network is a Fast R-CNN for feature extraction and outputting the Bounding Box and Class Labels. Here, the RPN serves as an Attention Mechanism in the Faster R-CNN pipeline. Let’s understand the importance of RPN and how it is replacing the Selective Search Algorithm. Given below is the pictorial representation of RPN in Faster R-CNN.

*Region Proposal Network within the Faster R-CNN architecture. Image Credits – Faster R-CNN paper*

In the Faster R-CNN network, there exists one backbone CNN network, and the output features are utilized by both RPN and the Object Detector Network, which is the Fast R-CNN. The Region Proposal Network uses a Sliding Window approach i.e. slides a window with a specific size all over the feature map and generates ‘k’ Anchor Boxes of different shapes and sizes. Given below is the set of Anchor Boxes generated by the RPN per image. By default, the value of k is 9 with 3 different scales of 128×128, 256×256, and 512×512 sizes with 3 aspect ratios of 1:1, 1:2, and 2:1. The below images give us a concept about the Anchor Boxes and the RPN’s Sliding window in action.

*Anchor Boxes’ Configurations within the Faster R-CNN’s RPN network. Image Credits – TowardsDataScience*

*Sliding Window generating the k Anchor Boxes – RPN. Image Credits – GeeksForGeeks*

The task of RPN is to predict the possibility of an anchor being background or foreground (containing the object). While training, the input image should be accompanied by the ground truth set of anchor boxes and improve the region proposals by training this network. Considering a feature map of 40×60, 9 anchor boxes or proposals will result in 20k proposals generated which remains a large number. The authors included a Softmax Layer from which we get the confidence scores, rank them, and take just the top-n anchor proposals. An anchor is considered positive (Presence of object) based on either of the two conditions – Anchor has highest IoU (Intersection over Union – Measure of Overlap) with the ground-truth box or the Anchor has an IoU of greater than or equal to 0.70 with any ground truth box. On the other hand, an anchor box is negative if the IoU is less than or equal to 0.30. The remaining anchors are discarded for training. If we sample all anchors, there might be a bias towards negative samples. To solve this problem, 128 positive and negative samples are selected randomly.

In addition to the binary softmax classifier, there exists a Linear Regression layer which outputs x, y, w, h coordinates of the anchor (x, y – Center of the Anchor, w – Width and h – Height). This is applied only if the anchor is predicted as positive. All cross-boundary anchors are discarded as they don’t contribute much to the optimization. We will revisit the loss functions and training procedure at a later point in this article. Now, with the RPN explained, the detailed Faster R-CNN pipeline will look as shown below:

*The Detailed Faster R-CNN Architecture. Image Credits – TowardsDataScience*

The different sized proposed regions generated by the RPN are fed to the ROI Pooling Layer. Refer to the Blog Post on Fast R-CNN for a better understanding of ROI Pooling. Here various dimensional representations are pooled into a k-dimensional vector which is in turn given to the Softmax Classification Layer and Bounding Box Regressor Layer. Apart from the RPN, the remainder of the architecture is a Fast R-CNN as the detector network. With the architectural understanding, now let’s see how we train this network with its multiple loss functions.

The RPN is optimized for the given below multi-task loss function. It consists of classification loss combined with regression loss. In the loss function, p_{i}, p_{i}* are the predicted probability of anchor i being an object and the ground truth label whether anchor i is the object respectively. The L_{cls} is again a log loss function with 2 classes – sample is the target object versus not.

The regression loss uses a smoothing L1 function. Here t_{i} and t_{i}* are the four coordinates and the ground truth coordinates respectively.

*The Loss functions of RPN*

The N_{cls} is a normalization term set to the mini-batch size which is 256 and the N_{box} is also a normalization term set to the number of anchor boxes (~2500). The λ is set to 10, which is a balancing parameter such that L_{cls} and L_{box} are weighted equally. The RPN is trained via end-to-end backpropagation and standard Stochastic Gradient Descent with a learning rate of 0.001. But, we want to optimize both the RPN and the Detector network to share the convolution layer features, which will decrease the inference time. The authors came up with a 4-step training procedure which enables the learning utilizing the shared features via alternating optimization.

The RPN is trained first independently, with pretrained ImageNet weights and fine-tuned end-to-end for the Region Proposal task. In the next step, the Detector network of Fast R-CNN is trained and fine-tuned end-to-end using proposals generated by the trained RPN. The Fast R-CNN’s layers are also initialized with ImageNet pre-trained weights. Now, we use this trained Fast R-CNN detector network to initialize RPN training and fine-tune only the RPN specific layers, the other layer weights are frozen. From here on, the convolutional layers are shared between both the networks. In the final step, we again fine-tune the specific or unique layers of the Fast R-CNN. Now, we’ve got a unified model with both networks sharing the Convolution layers. “This procedure can be repeated, but there was no significant improvement”, said the authors.

The Faster R-CNN achieves an mAP of 66.7% on the PASCAL VOC 2007 dataset and up to ~79% when trained on PASCAL VOC 2007, VOC 2012, and the COCO datasets. The inference time decreases to ~0.2 s per image when the RPN is used when compared to the ~2.5s without the RPN-based Fast R-CNN. Thus, we can conclude that the RPN contributed marginally to the mAP and greatly sped up the process. There were many experiments conducted to fine-tune the number of proposals and the datasets used in combinations for training. I suggest reading the full Faster R-CNN paper for all the intricate details. There are few interesting observations too, do give the paper a read. In December 2015, a Faster R-CNN with a ResNet-101 backbone network won the COCO Object Detection Competition and is considered one of the State-of-the-Art Networks for Object Detection to date! The Faster R-CNN was extended to a Pixel-Level Image Segmentation in 2017 which is the popular Mask R-CNN utilized in many real-world applications. We will discuss the Mask R-CNN in continuation of discussing the R-CNN Family of Object Detectors in our next blog post. Until then, try implementing the Faster R-CNN and put down your thoughts and observations of Faster R-CNN in the comments below.

Pranav Raikote

SPPnet was released shortly after R-CNN and it improved the bounding box prediction speed and had a similar mAP when put next to the R-CNN. An important feature of SPPnet was that the condition of having a fixed input image size from the R-CNN was lifted! Image size could be anything, and the network still worked flawlessly, making the architecture model agnostic of input image size. To understand why that is significant, let’s understand why the fixed input size is compulsory for a Convolutional Neural Network.

Convolution layers always compute and output a feature map that’s proportional to a specific ratio called the sub-sampling ratio. This constraint of fixed size isn’t because of the Conv layer, but due to the Fully Connected Layers. The FC Layers always have a fixed-length vector input. To solve this problem, the authors replaced the last pooling layer with a Spatial Pooling layer. Now, you may think “How can a Pooling layer solve this as it also has a fixed window size and stride values?” Hmm, this is strange right? The answer to this is a special way of pooling i.e. Spatial Pyramid Pooling. Usually, a CNN will have a single layer of pooling or no pooling before the FC Layers, but here the authors introduced multiple variable scale poolings that are concatenated to form a 1-d vector for the FC-Layer. As shown in the below image, the SPPnet had 3 layers of Pooling of different scales.

Considering there are 256 feature maps from the last Conv layer,

- Each feature map is pooled into 1 value forming a 256-d vector
- Each feature map is pooled into 4 values forming a 4×256-d vector
- Each feature map is pooled into 16 values forming a 16×256-d vector

The SPP Layer output is flattened to form a 1-dimensional vector and sent to the FC Layer. This eliminates the cropping of the input image to a fixed size before inputting to a CNN. One can apply the SPP Layer to any CNN architecture, but due to the limitations of CNN architectures back in 2014, the authors applied it to AlexNet, Overfeat, and ZF-Net with minute modifications to padding to get the required feature map output.

The authors then took advantage of the varied input size and trained it with sizes of 180×180 and 224×224 to enhance the robustness of the network. A 4-level SPPnet was used of scales 6×6, 3×3, 2×2, and 1×1. There was a decrease in error rate with just the SPP Layer and it further improved with Multi-size training (Training with different input sizes). There was cropping from 4 corners and center, and the image was flipped to produce a total of 10 images from a single image. This is multi-view, this approach was used extensively in testing.

You might be thinking, “All this is fine but how did it improve Object Detection?”

The authors used the SPP mechanism for object detection in an improved approach. Rather than sending the 2000 region proposals one by one to the CNN model, they projected the regions onto the Feature map obtained from the 5th Conv Layer. Just to clear the thought of similarity between the approaches of SPP-net & Fast R-CNN, the SPP-net was published in Apr 2014 and the Fast R-CNN in Apr 2015. We’ll see the differences between Fast R-CNN & SPPnet at the end.

This eliminates 2000 passes the CNN needs to undergo for each image. Now, suddenly from 2000, it is just 1. However, the Selective Search continues to be the bottleneck because it needs to generate 2000 proposals. These regions are sent forward to the SPP Layer for pooling into a 1-d vector. This reduced the computation time to a great extent. The time taken for a test image inference on a GPU was well within 1s and was massively fast in comparison to R-CNN and on par with accuracy too.

On the PASCAL VOC 2007 dataset, the SPPnet got an accuracy of ~59%, which was higher than the ~54 % of the R-CNN. And on the ImageNet dataset, SPPnet was able to achieve an mAP of ~35% when compared to ~31% for R-CNN. The below image illustrates the difference between the pipelines of R-CNN and SPPnet (partial illustration).

With SPP-net, although there isn’t a considerable increase in the mAP from the R-CNN, the speed has certainly increased while maintaining the accuracy of R-CNN. Coming to the drawbacks, the training was still multi-stage (which was solved by the Fast R-CNN) and there wasn’t a substantial jump in the accuracy compared to R-CNN. Hold tight for the next post on Faster R-CNN where the entire Object Detection was not decoupled like R-CNN, SPP-net, and Fast R-CNN. The time-consuming Selective Search was done away with and a Region Proposal Network was introduced.

Until then, I would like to recommend looking at the SPPnet paper for more details on the Image Classification aspects. This is the Image-Net published presentation for the SPP-net paper as an extra resource for your learning. Do try your hand at implementing this on a small-scale dataset. I’m sure you may encounter some interesting observations.

In our next post, we will continue with our explanations of the Fast R-CNN Family with the Faster R-CNN architecture. Until then, keep learning and share your thoughts on this post.

Pranav Raikote

References:

- SPPnet Paper : https://arxiv.org/pdf/1406.4729.pdf
- SPPnet Presentation Slides : http://image-net.org/challenges/LSVRC/2014/slides/sppnet_ilsvrc2014.pdf
- AlexNet Paper : https://papers.nips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
- Overfeat Paper : https://arxiv.org/abs/1312.6229
- ZF-Net Paper : https://cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf

In our previous article on Region-based CNN, we understood the essential building blocks of this breakthrough architecture. If you have studied the R-CNN paper, you will find interesting observations. Before fine-tuning, it was observed that even after adding more Fully Connected Layers there was no difference in accuracy, which meant that the Convolutional Layers have contributed to the accuracy and the FC Layers added very less value. On the other hand, it was found after fine-tuning that the majority of the weights were altered within the fully connected layers, which led to an increase in accuracy. This concluded that the Convolutional Layers captured more generalizable features and the FC Layers captured the specific features. We can experiment with the FC Layers for finding the trade-off between Accuracy, Size of the Model & Inference Time. The authors also said that the utilization of SVM over the fine-tuned CNN for detection was due to 2 main factors – Positive examples do not emphasize the precise location and Negative Examples were based on Easy/Soft negatives rather than Hard negatives. Soft Negatives are regions that contain empty and plain backgrounds and Hard negatives can contain partial objects and be quite noisy too, which is easily misclassified.

The R-CNN had its own set of drawbacks. They were mainly associated with the inference time – which was very slow. The reasons are not one but three! First, the Selective Search will output 2000 region proposals for each image. Second, CNN will extract features (N*2000; where N is the number of images). Third, it’s a fancy complex multi-stage training pipeline of three separate models working sequentially without any shared computation – CNN, SVM, and Bounding Box Regressor. Due to the above reasons, the R-CNN takes around ~45s for giving inference for a single image running on a GPU! This was 9 times slower than the previous best performing model – Overfeat. These are the explanations why the R-CNN wasn’t deployable in real-world or real-time scenarios.

What happened after R-CNN? Was there any improvement to the R-CNN? Could researchers come up with a better solution to this problem? Yes, they did. They tweaked the R-CNN with subtle modifications and behold – Fast R-CNN came into existence! It was accurate and faster than R-CNN by a huge margin having an inference time of just ~2s! That’s an enormous jump from ~45s. Let’s understand the improvements of Fast R-CNN over the R-CNN.

The authors of R-CNN worked on improving some of the shortcomings of R-CNN and found a better way to feed the Region Proposals and save a lot of training and inference time. One of the major bottlenecks was generating 2000 Region Proposals per image, which added up to many numbers of forward passes per image. The R-CNN architecture took ~87 hours for training. The authors found a way to decrease this overhead of computation. How about we generate a single feature map per image and project the Region of Interests (RoIs) on the feature map itself? This avoids the tedious and computationally intensive task of generating feature maps for each of the generated region proposals. So, the modified architecture looks like the image shown below.

Each image is fed to the CNN and at the end of the network i.e. VGG-16, the FC and MaxPooling Layers are removed and we output the generated feature map for the image. The Selective Search will generate the 2000 region proposals which are projected onto the CNN generated Feature matrix. The RoI Pooling is a way to convert the features in the projected region of the image in any size to fixed window size. This makes sure the output dimension is always constant. Note that warping does take place here also, to an extent. The RoI Pooling layer always outputs fixed-length feature vectors and further processing happens on this feature vector/matrix. Then the model branches into 2 output layers – Object Classification and Bounding Box Regression Layers. The softmax layer for classification consists of K+1 class, the +1 is for a background class which outputs a discrete probability distribution for the RoI. The Bounding Regressor predicts the offset for the original RoI for the K classes.

One thing to note here is that the training features a combined learning procedure – fine tuning the CNN, and classification plus regression of the bounding box. The loss function used for the localization task is a smooth L1 loss function. The final loss function is a combination of classification and localization. Hence, the network is back propagated with a single loss and this solves the complex multi-stage non-sharable computation problem. For an in-depth understanding of loss functions and the various parameters while training, I suggest a detailed read of the Fast R-CNN paper.

The Fast R-CNN was fast and reduced the training time to around 9.5 hours but there wasn’t much of an improvement to be seen in terms of mAP because the accuracy stood near 69%. The Fast R-CNN is up to 45 times faster at test time which is a huge improvement.

But, is there still scope for improvement? Yes, there is! Remember the Selective Search needs to generate up to 2000 Region Proposals? That’s a bottleneck and the network is still not yet unified. Well, this bottleneck too was solved in the next iteration of models in the R-CNN Family, which is the Faster R-CNN model. There was one more architecture which came a wee bit before the Faster R-CNN model – SPPnet, which we will discuss in our next post and continue with the Faster R-CNN architecture in the subsequent posts. Until then keep learning, and do share your thoughts on this post.

Pranav Raikote

- R-CNN Paper : https://arxiv.org/pdf/1311.2524.pdf
- Fast R-CNN Paper : https://arxiv.org/pdf/1504.08083.pdf
- Slide 71 onwards (Fast R- CNN) : http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
- Fast R-CNN : https://leonardoaraujosantos.gitbook.io/artificial-inteligence/machine_learning/deep_learning/object_localization_and_detection#fast-rcnn

R-CNN, short for Region-based Convolutional Neural Networks, was first introduced in 2014 and has over 15000 citations today. It is one of the fundamental breakthroughs for Object Detection, and performed way better than any other implementations at that time. There are certain stages put together subtly. Let’s look into the overall architecture and then understand the different parts of the R-CNN Architecture in detail. Given below is the overall high-level architecture of R-CNN where sub-parts are: Generating Region Proposals, Extraction of Features using a Pre-trained network, Linear SVM for identifying the Class, and Bounding Box Regressor for Localization.

Coming to the initial step of the pipeline which is extracting Region Proposals, there are various techniques available for this task like Sliding Windows, Colour Contrast, Edge Boxes, Super Pixel Straddling, Selective Search, etc. Extracting Region Proposals is the process of sampling various cropped regions of the image with an arbitrary size which may or may not have the possibility of the object being inside the cropped region. Here, the Selective Search Algorithm was used in the R-CNN as it was found to be more effective and outputs up to 2000 category independent regions per image. Refer this for learning about the Selective Search Algorithm in depth. Selective Search is also known as a class-agnostic detector and is often used as a preprocessor to produce a bunch of interesting bounding boxes that have a high chance of containing a particular object. Since it is class-agnostic, we need to have a special classifier at the end for knowing the actual class to which the output bounding box that contains the object belongs. One important preprocessing step to be performed is Image Warping to the fixed predefined input size of the CNN, which is its innate requirement. The below images gives us a glimpse of Selective Search and the proposal boxes generated.

Next up is the Feature Extraction phase, where the authors used AlexNet as a pretrained network (which was popular then) to generate a 4096-dimensional vector output for each of the 2000 region proposals. Here we can use the Pre-trained AlexNet by removing the last softmax layer for generating the feature vectors, and then fine-tune the CNN for our distorted images and the specific target classes. The labels used are the ground-truths with the maximum IoU (Intersection over Union) which are in the positive category, the rest others are negative labels (for all classes). So, the output from this Feature Extraction Phase is a 4096-dimensional feature vector.

The vectors generated are used to train a Linear SVM for classifying and getting the class of the object. Here we need an individual SVM for each of the object classes we are training for. For each feature vector we have n outputs where n is the total number of classes we are considering and the actual output is the confidence score. Based on the highest confidence score we can make the inference of Object Class(es) given in a particular image. Given below is the graphical representation of the Feature Vector & SVM Computations Matrices.

The final stage is the Localization aspect of Object Detection. A regression model with an L1/L2 Loss Function is attached to predict the bounding boxes coordinates. This Bounding Box Regression is optional and was added later to the Original R-CNN implementation to increase the localization accuracy. The reason this was tried out at a later stage is that the Region Proposals already are a type of bounding boxes. We need to input the ground truth bounding box coordinates also while training this stage. The reason for the low accuracy observed i.e. ~45% was due to the warped images which contributed to the loss as images would appear distorted and stretched. To counter this, they fine-tuned the network by training using an n-layer softmax output layer. This increased the accuracy by 10%. One more problem encountered here was that the model might predict multiple bounding boxes for a single image, say around 5 considering a single object is in the image. Here a Greedy approach of iteratively sorting and selecting the boxes with the IoU confidence scores helps overcome the overlaps of multiple boxes and the single best bounding box coordinates are predicted.

It was then experimentally found that a Bounding Box Regressor helped to get the predicted bounding boxes closer to the ground truth coordinates. This led to a jump of at least 10% in accuracy and later when the VGG network was used in place of AlexNet, the accuracy reported was close to 66 %. The R-CNN achieved a mAP of 54% on the PASCAL VOC 2010 and 31% on the ImageNet datasets.

And finally, we are through! We learned the R-CNN Architecture in detail and understood the various stages and the techniques employed to solve the problems faced during the development of this model. Find here the R-CNN Paper. I’d recommend reading the full paper to get an exhaustive in-depth understanding of R-CNN and understand the various experiments and observations which are really really interesting!

In the next post, we will revisit R-CNN’s drawbacks and understand how it was overcome, which gave rise to faster Object Detection architectures. Until then, share your thoughts on this post and think about why the R-CNN had major drawbacks and wasn’t adequate for real-world deployment.

Pranav Raikote

- R-CNN Paper : https://arxiv.org/abs/1311.2524
- Selective Search Paper : http://www.huppelen.nl/publications/selectiveSearchDraft.pdf
- Refer Slides – Other Computer Vision Tasks (From Slide 17 & 53) : http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture11.pdf
- AlexNet Paper : https://papers.nips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
- VGG-16 Paper : https://arxiv.org/abs/1409.1556
- R-CNN : https://leonardoaraujosantos.gitbook.io/artificial-inteligence/machine_learning/deep_learning/object_localization_and_detection#rcnn

First comes Image Classification where the task is to spot the pixels contributing to a particular class label for a given dataset. The next level is Object Localization in which the goal is to detect the presence of objects and put a bounding box to depict the location. Bounding Box is a 2D Colored rectangle drawn on the image which shows the location of the detected object in the image. We can see multiple boxes with labels in the image shown above which is detecting dog, person, traffic lights etc. Building on top of this is Object Detection where we identify the class/label of the bounding box with a confidence score. Object Detection can be thought of as having two levels – Single Object Detection (where the detector is looking to detect one single object for which it is trained) and Multi-Object Detection (which will be trained to detect a multitude of objects in a single image). The highest level is Object Segmentation or Semantic Segmentation where we mark the pixels of every object rather than a bounding box. Here, overlapping objects are labeled quite accurately.

There are two main Object Detection models: Multi-Stage Object detector and Single Stage Object detector. The Region-based Convolutional Neural Networks (R-CNN) family is a Multi-Stage type as it involves more than one stage (two sub-stages). The SSD (Single Shot Detector) and the YOLO (You Only Look Once) families are of the type Single Stage detector as they classify and give the bounding box per image in a single network or single stage. Let’s briefly inspect these popular architectures.

The R-CNN consists of three modules – Region Proposal, Feature Extractor, and Classifier. The Region Proposal module deals with generating approximate bounding boxes (region proposals). The features of those region proposals are extracted using Deep Convolutional Neural Networks. Finally, the features are classified using a linear SVM classifier. The selective search algorithm is employed to come up with 2000 region proposals per image. The R-CNN works at a speed of around 7 FPS which is sort of slow when deployed in a real-world scenario. There are faster and more efficient versions of R-CNN namely Fast R-CNN and Faster R-CNN, which reduce the detection time from 49 seconds in the R-CNN to 0.2s from Faster R-CNN.

The Single Shot Detector comprises only 2 modules: Extraction of feature maps in the first module and then in the second module applying the Convolutional filters for Object detection. (Convolutional filters are the key building blocks of any CNN which helps in detecting the image contours and outputting corresponding features).The SSD applies 3×3 convolution filters for every cell to generate the predictions. Each Convolutional filter outputs 25 channels, 21 channels for each class, and one channel for the bounding box. In a typical SSD, we use a modified VGG16 as the Convolution Neural Net for computing the predictions. The SSD architecture consists of six more extra auxiliary layers stacked on top of the initial VGG16. The network makes a total of ~8700 predictions coming from the 6 added layers. We get a slighter higher FPS at the speed of 22-49 fps, which is quite suitable for real-time deployments. Given below is a pictorial representation of SSD Architecture.

The YOLO approach involves one neural network trained end to end that takes a picture input and outputs both bounding boxes and class labels for each bounding box. The predictive accuracy is low but it gives a throughput of 44-155 FPS based on the type of model. The input image is split into a grid of cells, where each cell is responsible for predicting a bounding box if the center of a bounding box falls within it. Each cell will predict a bounding box with x, y coordinates, height, width, and the confidence score. There are faster and newer versions of YOLO namely YOLOv2, YOLOv3, and YOLOv4 which perform better.

And Voila! We have had a fast tour of Object Detection and various methods of approaching the Object Detection problem. Sit tight and look forward to the following post for an in-depth review of each of the algorithms listed above. Until then, continue learning and share your thoughts on this article.

Pranav Raikote

- R-CNN Paper : https://arxiv.org/pdf/1311.2524.pdf
- Fast R-CNN Paper : https://arxiv.org/pdf/1504.08083.pdf
- Faster R-CNN Paper : https://arxiv.org/pdf/1506.01497.pdf
- SSD Paper : https://arxiv.org/pdf/1512.02325.pdf
- YOLO Paper: https://arxiv.org/pdf/1506.02640v5.pdf
- Object Localization & Detection : https://leonardoaraujosantos.gitbook.io/artificial-inteligence/machine_learning/deep_learning/object_localization_and_detection
- Object Detection : https://livebook.manning.com/book/deep-learning-for-vision-systems/chapter-7/v-8

Almost all neural networks, especially in the Computer Vision-Image Classification domain, heavily rely on Batch Normalization for training deep networks. May it be the smoothing of loss, reducing covariate shift, or the regularization effect, Batch Normalization has always given an unprecedented advantage in training a neural-net, till now. That’s about to change! The authors of NF-Net showed that we can train Deep Residual Networks without Batch Normalization by replacing it with a few other techniques that will lead to faster training and better accuracy than Batch Normalization-enabled networks. NF-Nets were faster to train and could have a batch-size of up-to 4096! But before diving into the details of the new alternatives and techniques, let’s first set the** context** by having a small walk-through of Batch Normalization.

Just as a quick recap, Batch Normalization is a method to train very deep networks like ResNet by standardizing the inputs for each mini-batch. There was a problem observed called internal covariate shift, which is the change of distribution of inputs between the layers, and it looked like the network was training towards a moving target. This is because neural nets are quite sensitive to the initial weights and the algorithm. Batch normalization solved this in the initial stages to a significant extent by shifting the inputs with a calculated mean and variance parameters that smoothen the loss landscape and stabilize the training.

The below image shows the various steps and formulae in the Batch Normalization procedure for each mini-batch.

- It is hugely successful in
**down-scaling the hidden activations**when training deep ResNets (ResNet, ResNeXt, etc) consisting of 1000s of layers. How? Batch normalization when used on the residual branch was effective in scaling down the activation outputs. On the other hand, when Batch normalization was applied to the skip branch, it induced a slight bias towards the residual branch. This ensured that the initial training was stable and led to efficient optimizations. when non-zero mean activations are outputted by functions like ReLU & GELU. The main issue was, as the network went deeper, the mean-activation got larger and more positive, which led to the network predicting the same labels for all training samples and the training unstable. Batch normalization solves this by making sure that the mean-activation is zero across all layers.**It eliminates the mean shift effect**- It has a
**good regularizing effect**and it can replace Dropout as a regularizing factor while training neural nets. The networks trained with Batch Normalization do not overfit easily and generalize well for any given data sample. This is mainly due to normalizing the noisy batch stats. Also, the validation accuracy can be improved by tuning the batch size. - It can
**train neural nets with bigger batch sizes and larger learning rates**. Since the loss landscape is smooth, larger and stable learning rates can be used for training. This might not be efficient for smaller batch sizes. Also, it achieves the same accuracy in fewer steps compared to a non-batch normalized neural net, thereby improving the training speed.

- It can be
**computationally intensive**in a few networks due to the calculation of mean and scaling parameters and storing them for the back-prop step. - There can be
While training, the network might have learned and trained to certain batches, which makes the network dependent on that batch-wise setup. So it might not perform well when a single example is provided at inference.**discrepancies between the behavior of the network during training and testing times.** - Batch normalization
**breaks the independence between examples within a batch.**This means the examples selected in a batch are significant and lead us to two more prospects –**Batch Size matters**and inefficient**Distributed Training**that will lead to the network cheating the loss function.- Batch size matters because in a small batch, the mean approximate is going to be noisy whereas in a bigger batch the approximation may be worth considering. It has been observed that larger batches lead to stable and efficient training.
**Also, the performance of Batch Normalized networks can degrade if the stats have a large variance while training.** - Coming to the second point of cheating loss functions, difficult distributed training, and replication of results in another hardware, when we distribute the training in parallel streams, each stream will receive a portion or shard of a batch and a forward pass is applied. The discrepancy comes when there is no communication between the Batch Normalization layers i.e. all the streams will calculate the mean and variance parameters independently. This is wrong because the parameters are not holding true for the entire batch but rather for each shard, which leads to
**cheating**the loss function.

- Batch size matters because in a small batch, the mean approximate is going to be noisy whereas in a bigger batch the approximation may be worth considering. It has been observed that larger batches lead to stable and efficient training.

From the above discussion on Pros and Cons, we understand that although Batch Normalization has been instrumental in training deep networks, there are major disadvantages, and the authors put forward the idea of heading towards a new direction of Neural Nets free of batch normalization. The way to achieve this was to replace batch norm by suppressing the hidden activations on the residual branch. The authors of this [Paper] implemented a normalizer-free ResNet by suppressing the residual branch at initialization and using **Scaled Weight Standardization. **Weight Standardization controls the first and second moments of the weights of each output channel individually in convolutional layers. Weight Standardization standardizes the weights in a differentiable way which aims to normalize gradients during back-propagation.

These nets (**NF-ResNet**) were able to match the accuracy of Batch Normalized ResNets but struggled with larger batch sizes and failed to match the current state-of-the-art EfficientNet. **NF-Nets are built on top of this research work.**

- The authors propose a new method,
**Adaptive Gradient Clipping,**which can be used to clip the gradient based on the unit-wiseallowing training of NF-Nets with larger batches and stronger data augmentations.*ratio of gradient norms and parameters norms* - It introduced a
**family of Normalizer-free ResNets**, NF-Nets which surpass the results of the previous state-of-the-art architecture, EfficientNets. The largest NF-Net model achieved a top-1 accuracy of 86.5% (new state-of-the-art) without the use of extra data! - It shows that the
**NF-Nets outperform the Batch Normalized networks**in terms of validation accuracies when fine-tuned on ImageNet. The top-1 accuracy post fine-tuning is 89.2%

Before going into Adaptive Gradient Clipping or ADC in short, what is gradient clipping? Gradient Clipping is a method to limit a huge change in gradient values either positively or negatively. To put it in simple terms, we don’t want the gradient to take big jumps while finding the global minima. We simply clip off the gradient value when it is too much. But, we also have to accommodate the scenario where the gradient has to be large enough to come out from the local minima or correct its course while traversing through the loss landscape. If the path of the resultant is good enough, we will get it again for sure, but on the other hand if it is a bad gradient, we want to limit its impact.

The authors hypothesize that gradient clipping will enable the training of NF-Nets with a large batch size and larger learning rate. This is because of previous works in Language Modelling [Paper] to stabilize the training and it also allows the usage of larger learning rates, thereby accelerating the training. The standard gradient clipping [Paper] is given by,

where G is the gradient vector, G = ∂L/∂θ, where L denotes the loss and θ denotes a vector with all model parameters. This clip is done before updating the θ. The lambda (λ) is the clipping threshold hyperparameter that must be tuned. The authors observed that this hyperparameter is very highly influential for the training (training stability was extremely sensitive to the clipping threshold value chosen) and hence required very fine-grained tuning. To counter this, they introduced **Adaptive Gradient Clipping or ADC**.

The gist of ADC is that it clips the gradient w.r.t this particular ratio – norm of Gradients G^*l* to the norm of Weights W^*l* for the layer *l* (How large the gradient is / How large the weight is that the gradient acts upon). This will give a measure of the change in the gradient step that will change the original weights. Specifically, each unit *i *of the gradient of the *l-*th layer - G(*l)(i) *given by,

where epsilon = 10^-3 to prevent the zero-initialized parameters to be clipped back to 0. With the usage of AGC, it enabled training neural networks stably with batch sizes of 4096, which is quite a massive number for a batch. λ, which is the clipping parameter, is set depending upon the optimizer, learning rate, and batch size. The below image depicts the scaling of NF-Nets for larger batchsizes.

Empirically, it was found that **a lower clipping threshold holds good for larger batch sizes. **Consider a large batch size, where the batch stats are not that noisy. Here the clipping must be set low, else it will collapse. Likewise, if the batch size is very small, there might be more noise in the batch stats and we can get away with a higher clipping threshold.

Coming to the architecture, NF-Net is a modified version of SE-ResNeXt-D [Paper]. The model has an initial “stem” comprised of a 3×3 stride-2 convolution with 16 channels, two 3×3 stride-1 convolutions with 32 channels and 64 channels respectively, and a final 3×3 stride-2 convolution with 128 channels. The activation function used here is GELU [Paper] which performs as good as ReLU & SiLU.

The Residual stages consist of two types of blocks, starting with a transitional block and followed by the non-transitional block. All blocks employ the pre-activation ResNe(X)t bottleneck pattern with an added 3×3 grouped convolution inside the bottleneck. Following the last 1×1 convolution is a Squeeze & Excite layer which globally average pools the activation, applies two linear layers with an interleaved scaled non-linearity to the pooled activation, applies a sigmoid, then rescales the tensor channel-wise by twice the value of this sigmoid.

After all of the residual stages, we apply a 1×1 expansion convolution that doubles the channel count. This layer is primarily helpful when using very thin networks, as it is typically desirable to have the dimensionality of the final activation vectors (which the classifier layer receives) be greater than or equal to the number of classes. Coming to the final layer, it outputs a 1000-way class vector with learnable biases. This layer has its weights initialized to 0.01 and not 0.

The **important** feature of NF-Nets is **“Normalizer Free”. **The input to the main path of the residual block is multiplied by 1/β, where β is the predicted value of the variance at that block at initialization, and the output of the block is multiplied by a scalar hyperparameter α, typically set to a small value like α = 0.2. These scalars α and β are very instrumental in achieving the Normalizer Free implementation. The formulae are given below,

The model and its configurations are quite complex and going through the paper will help if you’re replicating/implementing the NF-Nets. The experiments also are a tad lengthy and to keep it short and crisp in this blog, I have not included the experiments and ablation studies.

As we see, NF-Nets outperform their Batch Normalized counterparts. **The NF-Net-F5 model achieved a top-1 validation accuracy of 86%, improving over the previous state-of-the-art results. And behold, we have the NF-Net as the new state-of-the-art network**, whereas the NF-Net-F1 model matches the EfficientNet-B7’s score of 84.7% (all depicted in the below table).

- The training examples are still implicitly dependent on the batch if we observe how the
**implementation**is done: Clipping after the averaging operations for that batch. - The observation of different behaviors in train and test time was quoted as one of the problems of Batch Normalization. But here, in the implementation, they have used Dropout, which also has different behavior in train and test times.

To conclude, there were a lot of things going on in this paper, but few were very significant for the successful development of Normalizer Free Neural Nets. The introduced NF-Nets surpassed the performance of the latest state-of-the-art for Image Classification (without using extra data) and was also faster to train. It was also shown that the family of NF-Nets is better-suited for fine-tuning on large datasets than the batch normalized variants.

Pranav Raikote

- NF-Net: https://arxiv.org/pdf/2102.06171v1.pdf
- EfficientNet: https://arxiv.org/pdf/1905.11946.pdf
- Batch Normalization: https://arxiv.org/pdf/1502.03167.pdf
- ImageNet: http://www.image-net.org/
- ResNet: https://arxiv.org/pdf/1512.03385.pdf
- ReLU: https://arxiv.org/pdf/1811.03378.pdf
- GELU: https://arxiv.org/pdf/1606.08415.pdf
- Unnormalized ResNet: https://arxiv.org/pdf/2101.08692.pdf
- Regularizing and Optimizing LSTM Language Models: https://arxiv.org/pdf/1708.02182.pdf
- Gradient Clipping & Regularization: https://arxiv.org/pdf/1211.5063.pdf
- SE-ResNeXt-D (Squeeze & Excitation Networks): https://arxiv.org/pdf/2102.06171v1.pdf

This work ‘Castle in the Sky’ proposes a vision-based method for video sky replacement and harmonization, which can automatically generate realistic and dramatic sky backgrounds in videos with controllable styles. This method runs in real-time and is free of user interactions and the authors decompose this artistic creation process into a couple of proxy tasks such as sky matting, motion estimation, and image blending.

Take a look at this video. The authors changed the sky and put a spaceship in there, which is already amazing. The spaceship is not stationary but moves in harmony with the other objects in the video. Moreover, since the sky has changed and the lighting situation has changed with it, the colors of the remainder of the image also have to change.

We can do so much more with this. For instance, put a castle in the sky, make it an extra planet. Let’s see if it is able to recolor the image after changing its surroundings. How well does it do when the background is changed to a dynamic one like a thunderstorm? This new method handles this case as well. Click here for the video.

So before we look under the hood to see how all this is done, let’s first list our expectations

- We expect that it has to know what pixels to change to load a different sky model.
- It should know how the image is changing and rotating over time.
- Some recoloring also has to take place.

Now let’s have a look and see how the model architecture is able to fulfill our expectations.

- It has a sky matting network and this network finds the parts of the image where the sky is, so this network does the work to fulfill our Expectation-1.
- For Expectation-2, there is the motion estimator that computes the optical flow of the image, this tracks the movement of the sky over time.
- And there is the recoloring module as well to do the recoloring (if required).

So this can do not only sky replacement but detailed weather and lighting syntheses are also possible.

By looking at the results, listing our expectations, and then examining the architecture of the neural network, we can evaluate the research work with some confidence.

Shubham Bindal

- Castle in the Sky: Dynamic Sky Replacement and Harmonization in Videos: https://arxiv.org/pdf/2010.11800.pdf