Softmax Regression

https://d2l.ai/chapter_linear-networks/softmax-regression.html

In the topic Log-Likelihood and others, what does ‘n’ represent?

n presents number of observation

Can somebody explain how -logp(y/x) = -sigma(y * log(y))

Here is my humble understanding:

note that \acute{y}_j== p(y == j|x)
and y is one hot code (all zero except one)
thus
- \sum ( y_j * log ( \acute{y}_j ) ) = - log (\acute{y}_y )= -log p(y|x)

poor english sorry

Can someone please explain this part in Question4?

Assume that we three classes which occur with equal probability, i.e., the probability vector is (13,13,13)
What is the problem if we try to design a binary code for it? Can we match the entropy lower bound on the number of bits?

What does it mean when we say entropy lower bound?

Hi, I was wondering is there any standard answer to the question2 and 3? The question 2 and 3 are as follows:

Just under equation 3.4.4, the equation “𝐨(𝑖)=𝐖𝐱(𝑖)+𝐛 where 𝐲̂ (𝑖)” is listed.

Should b also have a superscript (i.e. if I understand correctly, there is a separate bias for each output neuron)? This appears to be the case in equation 3.4.2.

Hi @mlrocks, please check https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/information-theory.html#properties-of-entropy to see the lower bound’s meaning.