 # Softmax Regression

In the topic Log-Likelihood and others, what does ‘n’ represent?

n presents number of observation

Can somebody explain how -logp(y/x) = -sigma(y * log(y))

Here is my humble understanding:

note that \acute{y}_j== p(y == j|x)
and y is one hot code (all zero except one)
thus
- \sum ( y_j * log ( \acute{y}_j ) ) = - log (\acute{y}_y ）= -log p(y|x)

poor english sorry

Can someone please explain this part in Question4?

Assume that we three classes which occur with equal probability, i.e., the probability vector is (13,13,13)
What is the problem if we try to design a binary code for it? Can we match the entropy lower bound on the number of bits?


What does it mean when we say entropy lower bound?

Hi, I was wondering is there any standard answer to the question2 and 3? The question 2 and 3 are as follows：

Just under equation 3.4.4, the equation “𝐨(𝑖)=𝐖𝐱(𝑖)+𝐛 where 𝐲̂ (𝑖)” is listed.

Should b also have a superscript (i.e. if I understand correctly, there is a separate bias for each output neuron)? This appears to be the case in equation 3.4.2.

Hi @mlrocks, please check https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/information-theory.html#properties-of-entropy to see the lower bound’s meaning.

In the equation 3.4.5: If XW results in an n by q matrix and b is a vector of size q, how does it get added to the matrix? In the preliminaries section, the book states that “column vectors to be the default orientation of vectors”. Should i just assume that in this case the vector is a row vector and it gets added to each row of the XW matrix?
I apologize if it’s a trivial question , but want to make sure that i get this right.

Yes, your assumption is correct. We have \mathbf{X} \in \mathbb{R}^{n \times d}, \mathbf{W} \in \mathbb{R}^{d \times q}, and \mathbf{b} \in \mathbb{R}^{1 \times q}. When we perform \mathbf{X}\mathbf{W} + \mathbf{b} in numpy, \mathbf{b} as a row vector is copied n times to get \mathbf{B} := [\mathbf{b}, \ldots, \mathbf{b}]^T \in \mathbb{R}^{n \times q}.