Is it possible to speed up fullyconnected calculation for sparse input?


#1

Hi,
I’m training a neural network model with mxnet. The input is a sparse one-hot vector. It has dimension of 1 million, but only tens of them are 1. The input is fullyconnected to a hidden layer with 200 nodes. I found it is very slow when training. Is there any way to speed up the fullyconnected calculation as the input is pretty sparse?


#2

You can encode the data in CSR format and replace FC with sparse.dot and broadcast_add. You can see the linear classification example here: https://github.com/apache/incubator-mxnet/tree/master/example/sparse


#3

@eric-haibin-lin I tried encoding the data in CSR format and replacing ndarray.FullyConnected with nd.sparse.dot, but I found its efficiency became even worse. My code is as following:
original: time cost 0:00:00.365667
#out1 = mx.nd.FullyConnected(features, self.w1.data(ctx), self.b1.data(ctx), num_hidden=self.num_hidden) #act1 = mx.nd.Activation(out1, act_type=‘relu’)

new: time cost 0:00:00.495941
out1 = mx.nd.sparse.dot(features, self.w1.data(ctx))
act1 = mx.nd.broadcast_add(out1, self.b1.data(ctx))

where w1 is weight matrix, and b1 is bias matrix. Features is the input, which is a 200 X 1000000 matrix with about 2000 non-zero values. And I have encoded it in CSR format.


#4

Hi @tppppppppp
Did you call act1.wait_to_read() to make sure the operation is completed?
https://mxnet.incubator.apache.org/tutorials/basic/ndarray.html#lazy-evaluation-and-automatic-parallelization


#5

@eric-haibin-lin I called act1.asnumpy() instead.


#6

I am a bit confused - what is the shape of w1 you are using? FullyConnected is calculating feature dot w1_transpose, which is different from dot(feature,w1). Are you getting consistent result here?

The following code works for me:

import mxnet as mx
import scipy.sparse as spsp
csr = spsp.rand(200, 1000000, format='csr', density=0.00001)
x_sparse = mx.nd.sparse.csr_matrix(csr)
w = mx.nd.ones((1000000, 100))
import time
mx.nd.waitall();
a = time.time(); 
y = mx.nd.sparse.dot(x_sparse, w); 
y.wait_to_read(); 
b = time.time(); 
print(b-a);
# 0.00143098831177

w_t = w.T
x_dense = x_sparse.tostype('default')
mx.nd.waitall();
c = time.time(); 
y2 = mx.nd.FullyConnected(x_dense, w_t, no_bias=True, num_hidden=100);
y2.wait_to_read(); 
d = time.time(); 
print(d - c);
# 0.451608896255

#7

but have you tried it with gpu


#8
import mxnet as mx
import scipy.sparse as spsp


csr = spsp.rand(200, 1000000, format='csr', density=0.00001).astype('float32')
x_sparse = mx.nd.sparse.csr_matrix(csr).as_in_context(mx.gpu())
w = mx.nd.ones((1000000, 100)).as_in_context(mx.gpu())
import time
mx.nd.waitall();
a = time.time();
y = mx.nd.sparse.dot(x_sparse, w);
y.wait_to_read();
b = time.time();
print(b-a);
# 0.3979964256286621

w_t = w.T
x_dense = x_sparse.tostype('default')
mx.nd.waitall();
c = time.time();
y2 = mx.nd.FullyConnected(x_dense, w_t, no_bias=True, num_hidden=100);
y2.wait_to_read();
d = time.time();
print(d - c);
# 0.0007915496826171875

correct me if there is anything went wrong, thx!