the example #5580 helped me pretty well starting to understand the data flow. thanks a lot @pengwangucla @saicoco.
now I wanna implement three custom loss functions which not only have an additional parameter (specifically a hyperparameter and not learned) but also are independant of the label (as the training is unsupervised and from that new layer perspective only depends on a binary transformation of the preceding layer).
W.r.t. this I got 3 issues:

am I right to assume that we can either use mx.operator.CustomOp or mx.operator.NDArrayOp ?

if I compare MXNets implementation of forward()/backward() pass to Caffe:
template <typename Dtype> void CustomLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top)
then
def forward(self, is_train, req, in_data, out_data, aux):
contains the data of one batch (each row corresponds to one sample) and not just one sample, correct ? 
I like to train the network in Python and do the inference in C++ later: what is the way to do this ?
using
c_predict_api.h,
or
mxnetcpp/MxNetCpp.h
as in https://github.com/apache/incubatormxnet/blob/master/cpppackage/example/feature_extract/feature_extract.cpp
in other words will these loss layers be registered for C++ as well or do we strictly have to rely on NNVM? 
I still don’t quite get the example on mxnet.incubator for new_op: If we look at the mathematical definition of the softmax function and its layer, it’s a function of the learned weight parameters w_{i} and the inputs x_{i} , so why are there no weights in the forward() pass ?