So I think of BatchNorm being broken down into 2 steps: the first step is normalizing the input (i.e. 0 centered, and unit variance), and the second step translating and scaling the distribution as required using the learnt parameters beta and gamma respectively. When normalizing the input as part of the first step we have to calculate the mean and the variance, so when know what to subtract and divide the input values by.
During the training phase we want to calculate estimates for the mean and the variance using the input batch of data (hence the Batch in BatchNorm), and use these for normalization for that batch. We get this behavior on the forward pass when setting
use_global_stats=False. We’re using local stats (i.e. just for that batch), and not ‘global stats’.
In the background we also update a running average for the mean and variance (called
running_var), but we don’t use it yet while
use_global_stats=False. Mean and variance estimates from single batches change quite a lot between batches, so we have this running average to smooth things out and get a more reliable estimate for the dataset’s mean and variance; we call these running average estimates the ‘global stats’.
When we set
use_global_stats=True during inference, we change the behavior of the forward pass. Instead of using the batch estimates of mean and variance (local stats), we now use the running estimates (global stats), that had been calculated during the training phase.
I don’t believe it’s common to set
use_global_stats=True for training, are there any other examples of this? Currently this layer would be initialized with running means as 0 and running variance as 1, so this layer wouldn’t do any initial scaling, and these are not updated while training. And for the second step, beta will be initialized as 0 and gamma as 1, but I would expect these to be updated during training so that some scaling behavior would be learnt.