I recently implemented custom normalization layers for image classification and object detection tasks in MXNet, namely GroupNorm and Feature Response Normalization. (I know there exists already a GN implementation in GluonCV). These Normalization blocks only consist of a few simple operations but slowing down my training speed by a factor of x2 compared to the standard BN.
Why take these custom blocks so much longer to compute? Is there any way to accelerate this apart of implementing them in C++?
Here is my code on the FRN (https://arxiv.org/abs/1911.09737):
def hybrid_forward(self, F, x, gamma, beta, tau, eps): # mean squared norm of x nu2 = F.mean(F.square(x), axis=[2, 3], keepdims=True) # filter response normalization x = F.broadcast_mul(x, F.sqrt( F.broadcast_add( nu2, F.abs(eps) ) ) ) # affine transformation and thresholded linear unit (TLU) x = F.broadcast_maximum( F.broadcast_add( F.broadcast_mul( gamma, x ), beta ), tau ) return x