# How to normalize the softmax and how the accuracy works?

#1

I am working on a VQA project and have basically 2 questions now.

First of all I would introduce the dataset, every training question has 3 answers, so I fit the sample into the model like `(question, ans1), (question, ans2), (question, ans3)`, So if I use the softmax to predict and I can get one answer at the end, so the accuracy could be at most `0.33`

Besides, I use `loss = gluon.loss.SoftmaxCrossEntropyLoss()` to be the training loss, and `mx.metric.Accuracy()` to be the evaluation, with the update pair as `metric.update([label], [output])`, where `label` is the training answer and `output` is the softmax vector of all possible answers

The training loop is using

``````cross_entropy = loss(output, label)
cross_entropy.backward()
``````

Here is something really strange, I use just 3 samples to test and I got the accuracy 73% (actually the accuracy could at most be `0.33` in my dataset) after 10 epochs. And to test this issue, I predict the training data with the model, and it gives really strange answer.

Here is my training data

``````what is in front of the chair,mirror,pool,shelf,
what is the color of the person's clothes in video,blue,dark blue,black blue,
what is the person doing in video,cleaning up,wiping mirror,washing cup,
where is the person in video,indoor,washroom,residence,
is the person sitting or standing in the video,standing,standing,standing
``````

And my predicting result is (each training question has 3 answers, and I just predict the one with the maximum softmax value)

``````what is in front of the chair,shelf,
what is the color of the person's clothes in video,cleaning up,
what is the person doing in video,washroom,
where is the person in video,kissing,
is the person sitting or standing in the video,light white
``````

I use `np.argmax` to get the answer from the softmax layer. And I print the softmax result, the first 3 lines of it is

``````answer is shelf with softmax [15.491705] <NDArray 1 @cpu(0)>
answer is cleaning up with softmax [8.109538] <NDArray 1 @cpu(0)>
answer is washroom with softmax [8.194625] <NDArray 1 @cpu(0)>
answer is kissing with softmax [7.8190136] <NDArray 1 @cpu(0)>
answer is light white with softmax [6.411439] <NDArray 1 @cpu(0)>
``````

So my 2 questions, 1) Obviously the accuracy is not as high as 73%, so how do the function `metric.update()` evaluate the accuracy, 2) How could the softmax value be over 1 or be negative number, isn’t it normalized? The official `Accuracy` evaluation says that ‘‘Prediction values for samples. Each prediction value can either be the class index, or a vector of likelihoods for all classes.’’ according to https://mxnet.apache.org/api/python/metric/metric.html, and it just consider the class with the maximum likelihood. How could it be if the likelihoods is above 1???

I know it is bothering to deal with so many things, so if anyone could explain the bold type question first, and maybe I can debug the code from it, thank you!

#2
1. I probably would not use Accuracy metric at all, since this metric should show how well your model works. In your case all 3 answers are possible. That means that by splitting data per answer you effectively make a term “accuracy” meaningless. I recommend to calculate accuracy separately on original dataset (before the splitting) by just looking if the resulting answer in the list of supported answers. That doesn’t change the way you calculate loss function.

2. What does your model outputs? Is it softmax on whole your vocabulary? Softmax is always normalized to be equal to 1. Check this out:

import mxnet as mx

a = mx.nd.array([-1, 15, 0.4])
b = a.softmax() # b is [ 1.12535112e-07 9.99999404e-01 4.56352183e-07]
c = sum(b) # c is 1

So, I am curious how exactly you get Softmax values? You don’t treat `SoftmaxCrossEntropyLoss` as outputing softmax, right?

#3

I think I did not do what you mean, I just followed the official demo of VQA, https://gluon.mxnet.io/chapter08_computer-vision/visual-question-answer.html and use the final layer as

``````self.fc2 = nn.Dense(num_category)
``````

and use the loss as

``````loss = gluon.loss.SoftmaxCrossEntropyLoss()
``````

the update of the model is

``````with autograd.record():
output = net(data)
cross_entropy = loss(output, label)
cross_entropy.backward()
trainer.step(data[0].shape[0])
``````

nothing more, and according to the result of the official demo, this would not affect the result, even if it has not been normalized to softmax

So, actually, we can use `SoftmaxCrossEntropyLoss` as the loss and use something else than softmax layer as the output, am I right? (At the beginning I thought if I use `SoftmaxCrossEntropyLoss` and the final layer would be normalized automatically to softmax)

#4

Thanks for providing the reference.

Yes, `fc2` doesn’t return softmax. If you want to get Softmax out of the output, you should write `output.softmax()`.

While technically it is more correct, it won’t change the result of prediction - if you look into the VQA example they use argmax to get the final results: `output = np.argmax(output.asnumpy(), axis = 1)`. Argmax of softmaxed result will be the same.

They don’t apply softmax in the network itself exactly because they use `SoftmaxCrossEntropyLoss`. If you look into documentation - https://mxnet.incubator.apache.org/api/python/gluon/loss.html?mxnet.gluon.loss.SoftmaxCrossEntropyLoss it applies Softmax to predictions internally before calculating final values.

1. Yes, we can use `SoftmaxCrossEntropyLoss`, but we shouldn’t apply `Softmax` when feeding the output to loss to avoid double softmaxing.

2. When we calculating final output we can apply softmax if we want to see probability like distribution of results. If we don’t care about probabilities and just want to do `nd.argmax` to get the most probable prediction, then you can do it even without softmaxing, because `argmax(output)` will produce the same result as `argmax(softmax(output))`

#5

Completely solved my questions, thank you very much! Answer accepted in # post 2

#6

Yet another question, additional to these 2, is that why the original accuracy could be 73% (actually 99% when converge)

First let me explain, this is due to the mistake of training label, but I am still curious why it give this result. As you see, each question has 3 answers, and these 3 answers may be different, so let us say there are 3 training samples, and they form a batch, and by mistake, all the labels fed to the model are the same answers, like this, the `ans1` in line 1 and line 2 are the same variable

``````question1, ans1, question1, ans2, question1, ans3
question2, ans1, question2, ans2, question2, ans3
``````

So the question batch is `question1, question1, question1`, and the answer batch is `ans1, ans2, ans3`, and during the evaluation, it should compared `(my_ans1, my_ans2, my_ans3)` with `(ans1, ans2, ans3)`, and according to the 99% accuracy, it should give the answer `(ans1, ans2, ans3)`, and actually it really output these answers

So as you see, the training batch is `(question1, question1, question1)`, 3 same data, I thought it would give 3 same answers, maybe all `ans1` or `ans2` or `ans3`, but gives different answers (actually it is right because in this data the result has a connection with the data index in the batch, when it is the first data in the batch is gives `ans1`, second gives `ans2`). So my final question is, does the training take the data index in the batch as an attribute? Otherwise how could it achieve the result above

#7

Training doesn’t take data index in batch as an attribute. The only information training receives is the information you explicitly pass in `data` when calling `output = net(data)`.

It is hard for me to explain why it happens. The only thing I can assume is that maybe some information about an index is passed to the network when original question is transformed into ndarray? Is there is a pattern if you provide a question1, 2, … not in a training loop, but in evaluation? Like, maybe you always receive the first answer out of 3 possible?