I was observing very slow dot products in mxnet, and it seems like summing an array is also very slow (~10x slower than numpy). Am I doing something wrong? is there a way to get this type of operations at least at the speed of numpy?

import mxnet as mx

import numpy as np

v = np.random.randn(1000)

v_mx_cpu = mx.ndarray.array(v)

v_mx_gpu = mx.ndarray.random.normal(shape=(1000,), ctx=mx.gpu(0))

In [11]: %timeit mx.ndarray.sum(v_mx_gpu)

34.3 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [12]: %timeit mx.ndarray.sum(v_mx_cpu)

40.9 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %timeit np.sum(v)

4.25 µs ± 15.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)