Hi @weleen,

So I think of BatchNorm being broken down into 2 steps: the first step is normalizing the input (i.e. 0 centered, and unit variance), and the second step translating and scaling the distribution as required using the learnt parameters beta and gamma respectively. When normalizing the input as part of the first step we have to calculate the mean and the variance, so when know what to subtract and divide the input values by.

During the training phase we want to calculate estimates for the mean and the variance using the input batch of data (hence the Batch in BatchNorm), and use these for normalization for that batch. We get this behavior on the forward pass when setting `use_global_stats=False`

. We’re using local stats (i.e. just for that batch), and not ‘global stats’.

In the background we also update a running average for the mean and variance (called `running_mean`

and `running_var`

), but we don’t use it yet while `use_global_stats=False`

. Mean and variance estimates from single batches change quite a lot between batches, so we have this running average to smooth things out and get a more reliable estimate for the dataset’s mean and variance; we call these running average estimates the ‘global stats’.

When we set `use_global_stats=True`

during inference, we change the behavior of the forward pass. Instead of using the batch estimates of mean and variance (local stats), we now use the running estimates (global stats), that had been calculated during the training phase.

I don’t believe it’s common to set `use_global_stats=True`

for training, are there any other examples of this? Currently this layer would be initialized with running means as 0 and running variance as 1, so this layer wouldn’t do any initial scaling, and these are **not updated** while training. And for the second step, beta will be initialized as 0 and gamma as 1, but I would expect these to be updated during training so that some scaling behavior would be learnt.