On my RTX 2080ti, dot products are no faster with FP16 than with FP32 (and the former is 4 times slower than equivalent PyTorch). Is there some flag or environment variable that I’m missing?

```
import mxnet as mx
import numpy as np
import time
n = 2**14
ctx = mx.gpu(0)
dtype = np.float16
with ctx:
a = mx.nd.zeros((n, n), dtype=dtype)
b = mx.nd.zeros((n, n), dtype=dtype)
c = mx.nd.zeros((n, n), dtype=dtype)
tic = time.time()
for _ in range(100):
mx.nd.dot(a, b, out=c)
res = float(c[0, 0].asscalar()) # "use" the result
print(time.time() - tic)
```

(This outputs about 60 for either `dtype`

)