Are zero gradients or additional forward in training suitable in mxnet?


I tried to develop the detection network using mxnet.

previously, I build a subnetwork using caffe.

this subnetwork requires the data spliting and additional forward operation in training.

In my configuration in mxnet, the subnetworks is

conv_part0 | conv_part1 | conv_part2 [ these are sharing weights ]

roi_pooled0 | roi_pooled1 | roi_pooled2 [ these features are extracted from conv_partx ]

        | feature_selector |                              

[ this selector choose one roi_pooled feature per roi. and, when backpropagation, this layer sets the gradient of the roi_pooled feature not selected to 0]

and the training is processed with additional forward operation to get the label

mod.forward using data_batch_1
data_batch_2 is updated using the output of mod.forward
mod.forward_and_backward using data_batch_2

however, the configuration doesn’t seem to work in back-propagation.

when i set the shared weight of conv to reverse identity matrix or feed the wrong label related to feature selector, it makes the same loss and accuracy.

I doubt two possibilities: in feature selector, the gradients of some data are zero and spread to convolution symbols by roi pooling. so maybe this zero and the spreading interrupts the proper back-propagation.
or the additional forwarding operation maybe interrupt the binding.

I am trying to find out the problem, however, this is not easy.

Are the additional forwarding and the back-propagation with useless data (gradients zero) suitable in mxnet?

any ideas how to solve this problems.