Performance issue of BatchNorm with use_global_stats=True

When finetuning an existing model, i want to fix all the base parameters, set I set the use_global_stats=True. However, the training speed is about half of use_global_stats=False (the default option). Any improvements?

Based on the C++ implementation, it doesn’t makes sense for use_global_stats to slow anything down. If anything, it is less computation, not more. Can you provide an example showing training slowdown?