When using softmaxOutput layer, if the param multi_output is set to be False, the input data and output data will be transfered into a 2-D tensor:

Tensor<xpu, 2, DType> in_data[softmaxfocalout_enum::kData].get_with_shape<xpu, 2, DType>(s2, Tensor<xpu, 2, DType> out_data[softmaxfocalout_enum::kOut].get_with_shape<xpu, 2, DType>(s2, s);

And then, Softmax(out, data) is used.

My question is:

- Why should use softmax function but not a sigmod function, which usually outputs a single value.
- Why is the label’s shape Tensor<xpu, 1, DType>, but not Tensor<xpu, 2, DType>, the same shape as input data?
- I would like to to use it in a multi-label image classification. Normally, the format of the image data is BCHW, and the 2D tensor’s shape should be B-CHW, which may leads to a speed problem: In backward, the GridDim may be too small (the batch is small).