GluonCV, Faster RCNN, normalization layers

The gluon-cv faster r-cnn model uses a special resnet50 model that “denormalizes” the input image if I understand it correctly. I assume this is added so the pretrained weights from the mxnet model zoo could be reused.

However I was wondering if this step could actually be left out by simply not normalizing the image in the dataloader?

@ifeherva It seems you are correct that it is doing the inverse transformation, with a *255 factor missing.
I am not entirely sure what is the reasonning behind the decision of proceeding that way rather than just multiplying the initial image by 255.
@Hang_Zhang @zhreshold could you advise on the reason behind this hard-coded rescale layer?

1 Like

I tried plugging in the resnet50v2 model from the mxnet model zoo which worked but the performance was much worse.

The reason is fairly simple, due to observations in our experiments, different input scales used did affect performances quite a lot. Whether it’s due to initialization scale or pre-trained model is still unknown.

Before we figuring out a generic solution, we use these hard-coded scaling layers for consistency throughout gluon-cv package.

1 Like

Thanks for the reply. If I use the resnet50v2 model without the rescaling I get considerably worse recall on my validation set.
See code below:

base_network = mx.gluon.model_zoo.vision.get_model(base_net, pretrained=pretrained_base)
features = base_network.features[:8]
top_features = base_network.features[8:11]
train_patterns = '|'.join(['.*dense', '.*rpn', '.*stage(2|3|4)_conv'])
return model_zoo.get_faster_rcnn(base_net, features, top_features, scales=(2, 4, 8, 16, 32),
                        ratios=(0.5, 1, 2), classes=['BG', 'CLASS1'], dataset=dataset,
                        roi_mode='align', roi_size=(14, 14), stride=16,
                        rpn_channel=1024, train_patterns=train_patterns,
                        pretrained=False)

On the other hand it is at least 3 times faster at inference time. Where does this speedup come?

less classes, therefore less non-maximum-suppression time

We use per-class NMS for best recall, so complexity of NMS is O(N) where N is number of foreground classes.

I turned off normalization in the dataloader and used the mxnet resnet50v2 model. Got better recall (still not as good as the gluoncv resnet50). Still, my model is 2-3 times faster at inference time om gpu. The only difference I see is that

self.layer0.add(nn.BatchNorm(scale=False, epsilon=2e-5, use_global_stats=True))

vs

self.features.add(nn.BatchNorm(scale=False, center=False))

Could use_global_stats be responsible for the speed difference?

Yes, we are investigating the bad perf of BN without CUDNN