I’m doing some quick test to compare speed of
numpy, and of GPU vs CPU. All is done in the mxnet_36 kernel of sagemaker p3.8xl notebook.
My test data is the following:
# Numpy a = np.random.rand(10**3, 10**5) b = np.random.rand(10**5, 10**4) c = np.random.rand(10**3, 10**4) # Numpy to NDArray A = mx.nd.array(a) B = mx.nd.array(b) C = mx.nd.array(c)
this runs in 5.7s:
y = np.tanh(np.dot(a, b) + c)
this runs in 2.6s:
Y = nd.tanh(nd.dot(A, B) + C) Y.wait_to_read()
the copy to GPU takes…39s! Sounds intuitively a bit long no?:
A_gpu = A.as_in_context(mx.gpu(0)) B_gpu = B.as_in_context(mx.gpu(0)) C_gpu = C.as_in_context(mx.gpu(0))
the matrix multiplication + addition on GPU takes…18s!
Y_gpu = nd.tanh(nd.dot(A_gpu, B_gpu) + C_gpu) Y_gpu.wait_to_read()
GPU 7 times slower than CPU for a matrix-multiply operation which is allegedly the strength of the V100… When I re-run the thing, next iterations take around 150ms. What is wrong with the first run? Why is there this “cold start” ?