I’m doing some quick test to compare speed of `NDArray`

vs `numpy`

, and of GPU vs CPU. All is done in the mxnet_36 kernel of sagemaker p3.8xl notebook.

My test data is the following:

```
# Numpy
a = np.random.rand(10**3, 10**5)
b = np.random.rand(10**5, 10**4)
c = np.random.rand(10**3, 10**4)
# Numpy to NDArray
A = mx.nd.array(a)
B = mx.nd.array(b)
C = mx.nd.array(c)
```

this runs in 5.7s:

`y = np.tanh(np.dot(a, b) + c)`

this runs in 2.6s:

```
Y = nd.tanh(nd.dot(A, B) + C)
Y.wait_to_read()
```

the copy to GPU takes…**39s**! Sounds intuitively a bit long no?:

```
A_gpu = A.as_in_context(mx.gpu(0))
B_gpu = B.as_in_context(mx.gpu(0))
C_gpu = C.as_in_context(mx.gpu(0))
```

the matrix multiplication + addition on GPU takes…**18s**!

```
Y_gpu = nd.tanh(nd.dot(A_gpu, B_gpu) + C_gpu)
Y_gpu.wait_to_read()
```

**GPU 7 times slower than CPU for a matrix-multiply operation which is allegedly the strength of the V100… When I re-run the thing, next iterations take around 150ms. What is wrong with the first run? Why is there this “cold start” ?**