I wanted to train a neural network with sparse input. So I build the network in following way:
data = mx.symbol.Variable('data', stype='csr') # hidden layer fc1_weight = mx.symbol.Variable("fc1_weight", shape=(100000, 200), stype='row_sparse') net = mx.symbol.dot(data, fc1_weight) fc1_bias = mx.symbol.Variable(name='fc1_bias', shape=(200,)) net = mx.symbol.broadcast_add(lhs=net, rhs=fc1_bias) net = mx.symbol.Activation(data=net, name='ac1', act_type='sigmoid') # output layer net = mx.sym.FullyConnected(data=net, name='fcout', num_hidden=120000) net = mx.sym.Activation(data=net, name='acout', act_type='sigmoid') label = mx.sym.Variable('softmax_label') net = mx.sym.Custom(data=net, label=label, name='ce', op_type='CrossEntropyLoss')
The training code is just prepare mini-batch data input and label, then call
update. When the
Adam optimizer is initialized by setting
lazy_update to be
True, the time cost for single mini batch is 40% faster than the dense training, i.e., training with same input/label data but building network without
However if the
lazy_update was set to be
False and no other code change was made, the training was much slower, about 3~4 times than dense.
I did some profiling. There are three
adam_update operations for each mini-batch. The first one should be the update of weight matrix between input layer and hidden layer. There is a sync copy GPU-GPU happened after the beginning of the
adam_update. In dense training that sync copy lasted for about 15ms, but for sparse and disabled
lazy_update, it took almost 600ms to be completed.
All the training was on single GPU and no data or model parallelism was implemented. The input data batch for sparse was prepared into
mx.nd.sparse.crs_matrix, and for dense it’s just in
mx.nd.array. The label batch is alwasy in
Please kindly help take a look. Appreciate that.