Softmax Regression

https://d2l.ai/chapter_linear-networks/softmax-regression.html

In the topic Log-Likelihood and others, what does ā€˜nā€™ represent?

n presents number of observation

Can somebody explain how -logp(y/x) = -sigma(y * log(y))

Here is my humble understanding:

note that \acute{y}_j== p(y == j|x)
and y is one hot code (all zero except one)
thus
- \sum ( y_j * log ( \acute{y}_j ) ) = - log (\acute{y}_y ļ¼‰= -log p(y|x)

poor english sorry

Can someone please explain this part in Question4?

Assume that we three classes which occur with equal probability, i.e., the probability vector is (13,13,13)
What is the problem if we try to design a binary code for it? Can we match the entropy lower bound on the number of bits?

What does it mean when we say entropy lower bound?