SoftmaxCrossEntropyLoss vs KLDivLoss

Can anyone explain the difference between using SoftmaxCELoss and KLDivLoss with Gluon? They seem to be measuring the same thing. How do you decide which one to use?

thank you.


You’re right that SofmaxCELoss and KLDivLoss seem to be measuring the same thing, from the perspective of optimization. The key difference is the derivation.

Let’s take classification problems where SoftmaxCELoss is used mostly. What we want to do is to make sure that the parameters of the network maximize the probability of observing the true labels we see given the data that we have. In the binary classification case you can simply use a sigmoid function to transform the scores from your model to a probability and then you can use log loss (negative log likelihood) or Binary cross entropy to measure how far your predicted probability is from the labels you observe. The extension of this to multiple classes is to use Softmax for transforming your network outputs to probabilities and to use Cross Entropy as a generalization of BCE. Here’s a good blog that explains this.

Crucially, one detail is that for a single data point, only the predicted probability assigned to the true label contributes to the softmax cross entropy loss. This means that if have 3 different classes in my data, and for a single data point my true label is 2 and my probability predictions is [0.1, 0.1, 0.8], then only the value of 0.8 which corresponds to label 2 affects the cross-entropy loss for that data point. This is because when we represent the true label 2 as a distribution then it’s represented as [0, 0, 1]. This is a feature of multi-class classification problems which allows us to use Cross Entropy in a very specific way for classification.

KLDivLoss on the other hand measures the difference between two probability distributions and is typically used more for when the output of the network are parameters of a distribution, and you want to measure the distance between the distribution that your network parametrizes and a true distribution from your data.

Here is a key difference between the two in gluon. By default, when you pass in your labels and predictions into SoftmaxCELoss it’s expects the labels to be the categorical indicator i.e 2 and the predictions to be the un-normalized scores from your network before softmax. With KLDivLoss by default it expects your labels to be a discrete probability distribution i.e [0.8, 0.1, 0.1] and your predictions to also be in the form of a log probability distribution.

1 Like