Question about batch normalization


#1

I find that most implementation based on batch normalization set

use_global_stats=True

such as DCN.

In the document about mx.sym.BatchNorm, use_global_stats is introduced

If use_global_stats is set to be true, 
then moving_mean and moving_var are used instead of data_mean and data_var to compute the output. 
It is often used during inference.

What means it should be set False in training and set True in inference? But why the codes set use_global_stats=True in training? If I make some mistakes, thank you for correcting.


#2

Hi @weleen,

So I think of BatchNorm being broken down into 2 steps: the first step is normalizing the input (i.e. 0 centered, and unit variance), and the second step translating and scaling the distribution as required using the learnt parameters beta and gamma respectively. When normalizing the input as part of the first step we have to calculate the mean and the variance, so when know what to subtract and divide the input values by.

During the training phase we want to calculate estimates for the mean and the variance using the input batch of data (hence the Batch in BatchNorm), and use these for normalization for that batch. We get this behavior on the forward pass when setting use_global_stats=False. We’re using local stats (i.e. just for that batch), and not ‘global stats’.

In the background we also update a running average for the mean and variance (called running_mean and running_var), but we don’t use it yet while use_global_stats=False. Mean and variance estimates from single batches change quite a lot between batches, so we have this running average to smooth things out and get a more reliable estimate for the dataset’s mean and variance; we call these running average estimates the ‘global stats’.

When we set use_global_stats=True during inference, we change the behavior of the forward pass. Instead of using the batch estimates of mean and variance (local stats), we now use the running estimates (global stats), that had been calculated during the training phase.

I don’t believe it’s common to set use_global_stats=True for training, are there any other examples of this? Currently this layer would be initialized with running means as 0 and running variance as 1, so this layer wouldn’t do any initial scaling, and these are not updated while training. And for the second step, beta will be initialized as 0 and gamma as 1, but I would expect these to be updated during training so that some scaling behavior would be learnt.


#3

Thank you for your replay.
In mx-rcnn, I can only find use_global_stats=True in this repo. Moreover, in mx-maskrcnn, it is same.


#4

In mx-maskrcnn. I notice use_global_stats is set to be True, but fix_gamma is set to be False, this means gamma and beta could be learnt in training.
So it means we can use use_global_stats=True in training, and the difference is moving_mean and moving_var will be used instead of data_mean and data_var, although it is different from the original BatchNorm.


#5

So use_global_stats=True in training will technically use the variables running_mean and running_var but they won’t update, so the won’t actually be running averages. Unless you’re using pre-trained weights for these parameters, they will be set to 0 and 1 respectively (for each input channel), and so no scaling will occur in this stage.