Does anyone have experience with implementing projected stochastic gradient descent in Gluon?
The use case I have is similar to training word embedding while imposing on word vectors to lie in the unit sphere. For small data size I can handle normalizing the entire embedding matrix rows every few iterations but this runs OOM when I need to run it on large datasets. It tried to load an embedding matrix of full size as auxiliary variable. Ideally when the sparse gradient step is taken the projection is performed on the active variables. I couldn’t find existing code for this.
Update: problem solved. I managed running the algorithm by writing the projected gradient steps in python using gradients computed inside of
autograd.record() and setting values of modified rows weights only manually inside the training loop. Note: the
set_data method does not allow to use sparse gradients efficiently, I had to implement this myself