MXNet 8 times slower than Numpy in a simple example

Description

MXNet is 8 times slower running a simple example than my code with numpy.

To Reproduce

I am providing a zip file with the necessary files to reproduce the problem.

Steps to reproduce

  1. Uncompress the zip file.
  2. Use Jupyter with python3 to open both examples files
  3. Compare the processing time between examples

What I have tried to solve it

  1. Running in my GPU
  2. Running in my CPU
  3. Running hybrid and non hybrid NN, both net.hybridize and loss_function.hybridize

Environment

System: Ubuntu 18.04.4 LTS 64-bit
MXNet version: mxnet-cu102mkl 1.6.0
Python version: Python 3.6.9
CPU: AMD Ryzen 9 3950x
GPU: GeForce RTX 2080 SUPER
RAM: 64 GiB

Processing times

With Numpy: 1.9 seconds (obviously in cpu)
With MXNet: 15.7 seconds (in cpu without hybridize)
With MXNet: 14.3 seconds (in gpu without hybridize)
With MXNet: 6.6 seconds (in cpu fully hybridized)
With MXNet: 6.1 seconds (in gpu fully hybridized)

Note

I hope this is the place to report this type of problems and get some help. Looking forward for your responses.

Zip file

I have uploaded the necessary files into this github post: