https://d2l.ai/chapter_linear-networks/softmax-regression.html

# Softmax Regression

In the topic Log-Likelihood and others, what does ānā represent?

n presents number of observation

Can somebody explain how -logp(y/x) = -sigma(y * log(y))

Here is my humble understanding:

note that \acute{y}_j== p(y == j|x)

and y is one hot code (all zero except one)

thus

- \sum ( y_j * log ( \acute{y}_j ) ) = - log (\acute{y}_y ļ¼= -log p(y|x)

poor english sorry

Can someone please explain this part in Question4?

```
Assume that we three classes which occur with equal probability, i.e., the probability vector is (13,13,13)
What is the problem if we try to design a binary code for it? Can we match the entropy lower bound on the number of bits?
```

What does it mean when we say `entropy lower bound`

?