http://d2l.ai/chapter_multilayer-perceptrons/weight-decay.html
In train(lambd) as well as train_gluon(wd), animator.add(epoch+1, …) should be changed into animator.add(epoch, …) because epoch starts from 1 in the for loop.
In 4.5.1, the stochastic gradient descent updates is a bit strange, shoudn’t the decay rate of w controlled by \lambda alone, why is batch size involved here?