Hi,

I recently implemented custom normalization layers for image classification and object detection tasks in MXNet, namely GroupNorm and Feature Response Normalization. (I know there exists already a GN implementation in GluonCV). These Normalization blocks only consist of a few simple operations but slowing down my training speed by a factor of x2 compared to the standard BN.

Why take these custom blocks so much longer to compute? Is there any way to accelerate this apart of implementing them in C++?

Here is my code on the FRN (https://arxiv.org/abs/1911.09737):

```
def hybrid_forward(self, F, x, gamma, beta, tau, eps):
# mean squared norm of x
nu2 = F.mean(F.square(x), axis=[2, 3], keepdims=True)
# filter response normalization
x = F.broadcast_mul(x, F.sqrt( F.broadcast_add( nu2, F.abs(eps) ) ) )
# affine transformation and thresholded linear unit (TLU)
x = F.broadcast_maximum( F.broadcast_add( F.broadcast_mul( gamma, x ), beta ), tau )
return x
```