Numerical Stability and Initialization

http://d2l.ai/chapter_multilayer-perceptrons/numerical-stability-and-init.html

\begin{aligned} {E}[h_i^2] & = \sum_{j=1}^{n_\mathrm{in}} \mathbf{E}[W^2_{ij} x^2_j] \ \end{aligned}

I’ve found the second equation of 4.8.4 hard to understand. Let’s say h_i = w1 * x1 + w2 * x2, then E[h_i^2] = E(w1^2 * x1^2 + w2^2 * x2^2 + 2w1 * x1 * w2 * x2]. The above equation simply abandons 2w1 * x1 * w2 * x2?

all clear, but a thing:
why Xavier Initialization should avoid problems as exploding/vanishing gradient? It seems just a method to give, in a first time, the same input variance to the model parameters, I don’t understand why and how is correlated with the exposed problems…

It comes from how your parameters map your input/features on the activation function.
The goal of this initialization is to keep zero mean/unit variance of the logit before the activation, to avoid vanishing gradients.

Please find a more detailed explanation in this answer https://www.quora.com/What-is-an-intuitive-explanation-of-the-Xavier-Initialization-for-Deep-Neural-Networks

Hope that this answers your question :slight_smile:

Very interesting and clear! Thank you :pray: :grin:

1 Like