Adagrad

https://d2l.ai/chapter_optimization/adagrad.html

can we not update S_t, instead we just use S_t = <g,g> at each time step? Then the learning rate doesn’t decay.

Then it equals to every time we update the weight by sign(g_t)

Basically the problem with your suggestion is that (\sum_i x_i)^2 is not the same as \sum_i x_i.