Custom Normalization Layers very slow

Hi,
I recently implemented custom normalization layers for image classification and object detection tasks in MXNet, namely GroupNorm and Feature Response Normalization. (I know there exists already a GN implementation in GluonCV). These Normalization blocks only consist of a few simple operations but slowing down my training speed by a factor of x2 compared to the standard BN.
Why take these custom blocks so much longer to compute? Is there any way to accelerate this apart of implementing them in C++?
Here is my code on the FRN (https://arxiv.org/abs/1911.09737):

    def hybrid_forward(self, F, x, gamma, beta, tau, eps):

        # mean squared norm of x
        nu2 = F.mean(F.square(x), axis=[2, 3], keepdims=True)
    
        # filter response normalization
        x = F.broadcast_mul(x, F.sqrt( F.broadcast_add( nu2, F.abs(eps) ) ) )

        # affine transformation and thresholded linear unit (TLU)
        x = F.broadcast_maximum( F.broadcast_add( F.broadcast_mul( gamma, x ), beta ), tau )

        return x