We defined the network here from the inputs to hidden and then to output. My understanding is that while computing the gradients, first we compute for hidden layer and then input layer as in backpropagation algorithm. But here are we directly computing the loss between input layer and output layer and computing the gradients? Is this approach true? I am confused here.
Backpropagation is now just in theories. We study it because it was this algorithm that helped researchers to optimize networks in 1980s-1990s, this algorithm was simply based on chain rule of derivatives.
But nowadays we don’t use backpropagation, we calculate gradient of cost wrt parameters by using automatic differentiation. This new approach provides faster gradient calculation. In this approach we basically build a computation graph, and then calculate the gradients.
Alright! Last line makes sense. So instead of taking two layers at a time and computing their gradients backward, this will compute the entire network first using all existing parameters and then compute the gradients with respect to parameters accordingly, right?
Yes. Well as such I’d suggest you not to worry too much about it. Because calculating gradients is not that much important. We need gradients only to implement gradient descent algorithm, which made deep learning what it is today. But much important part is how do we forward propagate.