Is it possible to speed up fullyconnected calculation for sparse input?

Hi,
I’m training a neural network model with mxnet. The input is a sparse one-hot vector. It has dimension of 1 million, but only tens of them are 1. The input is fullyconnected to a hidden layer with 200 nodes. I found it is very slow when training. Is there any way to speed up the fullyconnected calculation as the input is pretty sparse?

You can encode the data in CSR format and replace FC with sparse.dot and broadcast_add. You can see the linear classification example here: https://github.com/apache/incubator-mxnet/tree/master/example/sparse

1 Like

@eric-haibin-lin I tried encoding the data in CSR format and replacing ndarray.FullyConnected with nd.sparse.dot, but I found its efficiency became even worse. My code is as following:
original: time cost 0:00:00.365667
#out1 = mx.nd.FullyConnected(features, self.w1.data(ctx), self.b1.data(ctx), num_hidden=self.num_hidden) #act1 = mx.nd.Activation(out1, act_type=‘relu’)

new: time cost 0:00:00.495941
out1 = mx.nd.sparse.dot(features, self.w1.data(ctx))
act1 = mx.nd.broadcast_add(out1, self.b1.data(ctx))

where w1 is weight matrix, and b1 is bias matrix. Features is the input, which is a 200 X 1000000 matrix with about 2000 non-zero values. And I have encoded it in CSR format.

Hi @tppppppppp
Did you call act1.wait_to_read() to make sure the operation is completed?
https://mxnet.incubator.apache.org/tutorials/basic/ndarray.html#lazy-evaluation-and-automatic-parallelization

@eric-haibin-lin I called act1.asnumpy() instead.

I am a bit confused - what is the shape of w1 you are using? FullyConnected is calculating feature dot w1_transpose, which is different from dot(feature,w1). Are you getting consistent result here?

The following code works for me:

import mxnet as mx
import scipy.sparse as spsp
csr = spsp.rand(200, 1000000, format='csr', density=0.00001)
x_sparse = mx.nd.sparse.csr_matrix(csr)
w = mx.nd.ones((1000000, 100))
import time
mx.nd.waitall();
a = time.time(); 
y = mx.nd.sparse.dot(x_sparse, w); 
y.wait_to_read(); 
b = time.time(); 
print(b-a);
# 0.00143098831177

w_t = w.T
x_dense = x_sparse.tostype('default')
mx.nd.waitall();
c = time.time(); 
y2 = mx.nd.FullyConnected(x_dense, w_t, no_bias=True, num_hidden=100);
y2.wait_to_read(); 
d = time.time(); 
print(d - c);
# 0.451608896255

but have you tried it with gpu

import mxnet as mx
import scipy.sparse as spsp


csr = spsp.rand(200, 1000000, format='csr', density=0.00001).astype('float32')
x_sparse = mx.nd.sparse.csr_matrix(csr).as_in_context(mx.gpu())
w = mx.nd.ones((1000000, 100)).as_in_context(mx.gpu())
import time
mx.nd.waitall();
a = time.time();
y = mx.nd.sparse.dot(x_sparse, w);
y.wait_to_read();
b = time.time();
print(b-a);
# 0.3979964256286621

w_t = w.T
x_dense = x_sparse.tostype('default')
mx.nd.waitall();
c = time.time();
y2 = mx.nd.FullyConnected(x_dense, w_t, no_bias=True, num_hidden=100);
y2.wait_to_read();
d = time.time();
print(d - c);
# 0.0007915496826171875

correct me if there is anything went wrong, thx!

1 Like

Hi, do you have any idea to speed up the dot production between sparse and dense matrix on gpu?

This topic is quite far in time. I try to give you my advice.

Turn the dense tensor into a sparse one using the indices of the sparse tensor. Then do sparse-vs-sparse dot product. It might accelerate your execution when the sparsity ratio is high.

It has less time complexity than the full dot product in a thought experiment, not sure it would work well in practice.

thanks for your reply! I’ll have a try