Mxnet.nd.sum and dot ~10x slower than numpy?


#1

I was observing very slow dot products in mxnet, and it seems like summing an array is also very slow (~10x slower than numpy). Am I doing something wrong? is there a way to get this type of operations at least at the speed of numpy?

import mxnet as mx
import numpy as np
v = np.random.randn(1000)
v_mx_cpu = mx.ndarray.array(v)
v_mx_gpu = mx.ndarray.random.normal(shape=(1000,), ctx=mx.gpu(0))

In [11]: %timeit mx.ndarray.sum(v_mx_gpu)
34.3 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [12]: %timeit mx.ndarray.sum(v_mx_cpu)
40.9 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [13]: %timeit np.sum(v)
4.25 µs ± 15.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@eric-haibin-lin :slight_smile:


#2

I assume that some specific operators are much more optimized in Numpy than in Mxnet - CPU. As for GPU, the synchronization overhead might be too high for such a small amount of data. I am sure you are still going to benefit from using GPU, when doing real Neutral Network, especially with lots of data, though some particular operators with small amount of data might work slower.


#3

Thanks @Sergey

For sanity checks I ran tests with larger dimensions and we can see the improvement growing with the dimension! So this is good :slight_smile:

Now my question is: is there a way to get the same performance / acceleration in smaller arrays as well?

N = 1,000,000

In [3]: %timeit mx.ndarray.sum(v_mx_gpu)
34.3 µs ± 498 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [4]: %timeit mx.ndarray.sum(v_mx_cpu)
33.2 µs ± 438 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [5]: %timeit np.sum(v)
480 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

N = 100,000,000

In [8]: %timeit mx.ndarray.sum(v_mx_gpu)
33.2 µs ± 82.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [9]: %timeit mx.ndarray.sum(v_mx_cpu)
32.8 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [10]: %timeit np.sum(v)
86.8 ms ± 800 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


#4

To be honest, I don’t think there is a way you can do it.

Maybe you can apply some heuristics like: “if you know that the size of the array is ‘small’ convert it to numpy, do the operation, and then back to ndarray”, but I think you would have more problems with that solution than benefits…