@eric-haibin-lin I tried encoding the data in CSR format and replacing ndarray.FullyConnected with nd.sparse.dot, but I found its efficiency became even worse. My code is as following:
original: time cost 0:00:00.365667
#out1 = mx.nd.FullyConnected(features, self.w1.data(ctx), self.b1.data(ctx), num_hidden=self.num_hidden) #act1 = mx.nd.Activation(out1, act_type=‘relu’)
new: time cost 0:00:00.495941
out1 = mx.nd.sparse.dot(features, self.w1.data(ctx))
act1 = mx.nd.broadcast_add(out1, self.b1.data(ctx))
where w1 is weight matrix, and b1 is bias matrix. Features is the input, which is a 200 X 1000000 matrix with about 2000 non-zero values. And I have encoded it in CSR format.