Just some changes and (maybe) some corrections on the Gradients section:
I would say explicitly that this formulas are following the Denominator Layout
The second and third example (x_traspose * A and x_traspose * A * x) I think that A is assumed to have n rows instead of m as is said, neither of the two are possible if it has m rows (in fact the third one requires a square matrix) so this is confusing or just a mistake.
Thank you for the effort!
I suggest that the Numerator Layout should be used here for consistency, as the next chapter mentions the Jacobian (m by n matrix), which confused me quite a while.
You are right, for me the most important thing is to stay consistent. With the explanation of the Automatic Differentiation section of the "… the gradient of y (a vector of length m) with respect to x (a vector of length n ) is the Jacobian (an m \times n matrix) " they are not consistent with the previous ones, this is the Numerator Layout or Jacobian formulation and in the Calculus section is the Denominator Layout
Thanks. The formula in the Calculus section follows Denominator layout. It’s quite common in deep learning: when you differentiate a loss function (scalar) with respect to a tensor, the shape of the differentiation result is the same as that of the tensor in denominator layout.
I agree that consistency matters. Thus I just removed the Jacobian description (in Numerator layout) in the automatic differentiation section:
Just let us know if you feel more explanations are needed. Thanks.
Now is consistent, the explanation of the Jacobain there was I do not think it was needed, now there is no inconsistency.
Thank you very much for your effort!