https://d2l.ai/chapter_linear-networks/softmax-regression.html

In the topic Log-Likelihood and others, what does ‘n’ represent?

n presents number of observation

Can somebody explain how -logp(y/x) = -sigma(y * log(y))

Here is my humble understanding:

note that \acute{y}_j== p(y == j|x)

and y is one hot code (all zero except one)

thus

- \sum ( y_j * log ( \acute{y}_j ) ) = - log (\acute{y}_y ）= -log p(y|x)

poor english sorry

Can someone please explain this part in Question4?

```
Assume that we three classes which occur with equal probability, i.e., the probability vector is (13,13,13)
What is the problem if we try to design a binary code for it? Can we match the entropy lower bound on the number of bits?
```

What does it mean when we say `entropy lower bound`

?

Hi, I was wondering is there any standard answer to the question2 and 3? The question 2 and 3 are as follows：

Just under equation 3.4.4, the equation “𝐨(𝑖)=𝐖𝐱(𝑖)+𝐛 where 𝐲̂ (𝑖)” is listed.

Should b also have a superscript (i.e. if I understand correctly, there is a separate bias for each output neuron)? This appears to be the case in equation 3.4.2.

Hi @mlrocks, please check https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/information-theory.html#properties-of-entropy to see the lower bound’s meaning.

In the equation 3.4.5: If XW results in an n by q matrix and b is a vector of size q, how does it get added to the matrix? In the preliminaries section, the book states that “column vectors to be the default orientation of vectors”. Should i just assume that in this case the vector is a row vector and it gets added to each row of the XW matrix?

I apologize if it’s a trivial question , but want to make sure that i get this right.

Yes, your assumption is correct. We have \mathbf{X} \in \mathbb{R}^{n \times d}, \mathbf{W} \in \mathbb{R}^{d \times q}, and \mathbf{b} \in \mathbb{R}^{1 \times q}. When we perform \mathbf{X}\mathbf{W} + \mathbf{b} in numpy, \mathbf{b} as a row vector is copied n times to get \mathbf{B} := [\mathbf{b}, \ldots, \mathbf{b}]^T \in \mathbb{R}^{n \times q}.

See also Broadcasting Mechanism.