Just some changes and (maybe) some corrections on the Gradients section:

  • I would say explicitly that this formulas are following the Denominator Layout

  • The second and third example (x_traspose * A and x_traspose * A * x) I think that A is assumed to have n rows instead of m as is said, neither of the two are possible if it has m rows (in fact the third one requires a square matrix) so this is confusing or just a mistake.

Thank you for the effort!


Great catch! I guess you are right. It should be like the following:

I suggest that the Numerator Layout should be used here for consistency, as the next chapter mentions the Jacobian (m by n matrix), which confused me quite a while.

1 Like


You are right, for me the most important thing is to stay consistent. With the explanation of the Automatic Differentiation section of the "… the gradient of y (a vector of length m) with respect to x (a vector of length n ) is the Jacobian (an m \times n matrix) " they are not consistent with the previous ones, this is the Numerator Layout or Jacobian formulation and in the Calculus section is the Denominator Layout


@gpolo @minhduc0711

Thanks. The formula in the Calculus section follows Denominator layout. It’s quite common in deep learning: when you differentiate a loss function (scalar) with respect to a tensor, the shape of the differentiation result is the same as that of the tensor in denominator layout.

I agree that consistency matters. Thus I just removed the Jacobian description (in Numerator layout) in the automatic differentiation section:

Just let us know if you feel more explanations are needed. Thanks.

Now is consistent, the explanation of the Jacobain there was I do not think it was needed, now there is no inconsistency.
Thank you very much for your effort!

anyone has solution for 3rd question?

Hi @naveen_kumar,

Since \|\mathbf{x} \|_2 = (x_1^2 + x_2^2 + ...)^\frac{1}{2},
then \nabla_{\mathbf{x}} \|\mathbf{x} \|_2 = \frac{\mathbf{x}}{\|\mathbf{x} \|_2}.

You can try to get the partial gradient of x_i and concat these i^{th} entries together.

1 Like

Warning. Spoiler may contain answer. I hope asking these here is ok. I have no other way of validating my attemps.

Is the answer \nabla_{\mathbf{x}} \|\mathbf{x} \|_2 for question 3 because with

\nabla=\left[ \dfrac{x_{1}}{\sqrt{x_{1}+x_{2}+x_{n}}},\dfrac{x_{2}}{\sqrt{x_{1}+x_{2}+x_{n}}},...\dfrac{x_{n}}{\sqrt{x_{1}+x_{2}+x_{n}}}\right] the x's in the numerator can be represented as a vector \textbf{x} and the denominator is the definition of the l2 norm?

Also I would like to check my question for 4:

\dfrac{\delta u}{\delta a}=\dfrac{\delta u}{\delta x} \cdot \dfrac{\delta x}{\delta a}+\dfrac{\delta u}{\delta y} \cdot \dfrac{\delta y}{\delta a}+\dfrac{\delta u}{\delta z} \cdot \dfrac{\delta z}{\delta a}

If my answer for 4 is correct, however, I would like to provide feedback. I am very novice when it comes to Calculus and most of this material. If I had not watched a video for how to use the chain rule for multi variable Calc I would have had no clue how to proceed. I think including mention of using the graph method to plot out the chains could help others that were in my shoes.

Also the chain method example omits any use of the partial derivative operator and technically doesn’t require it for single variable chains. But, the text mentions "multivariate functions in deep learning " and the lack of seeing one confused me.

This book has been an invaluable resource for me and made things very accessible for someone with a weak maths background. Thank you so much to the authors and I hope my feedback can be of use.

Hi everyone, who has the solution to the 4th question