Using multiple gluon trainers with kvstore

Hi! I have 2 sets of parameters in the same model that I intend to optimize using 2 different optimizers. How can I do this using two different gluon trainers, on the same kvstore? Currently, doing this gives an error since the kvstore uses numbers as keys and not some unique ids.

Why do you need the same kvstore though? Why not one kvstore per trainer?

import mxnet as mx

# Fixing the random seed
mx.random.seed(42)

mnist = mx.test_utils.get_mnist()

batch_size = 100
train_data = mx.io.NDArrayIter(mnist['train_data'], mnist['train_label'], batch_size, shuffle=True)
val_data = mx.io.NDArrayIter(mnist['test_data'], mnist['test_label'], batch_size)

from __future__ import print_function
import mxnet as mx
from mxnet import gluon
from mxnet.gluon import nn
from mxnet import autograd as ag


# define network
net = nn.Sequential()
with net.name_scope():
    net.add(nn.Dense(128, activation='relu'))
    net.add(nn.Dense(64, activation='relu'))
    net.add(nn.Dense(10))


gpus = mx.test_utils.list_gpus()
ctx =  [mx.gpu()] if gpus else [mx.cpu(0), mx.cpu(1)]
net.initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)
trainer_1 = gluon.Trainer(net.collect_params('.*dense0.*|.*dense1.*'), 'sgd', {'learning_rate': 0.02})
trainer_2 = gluon.Trainer(net.collect_params('.*dense2.*'), 'adam', {'learning_rate': 0.02})

epoch = 10
# Use Accuracy as the evaluation metric.
metric = mx.metric.Accuracy()
softmax_cross_entropy_loss = gluon.loss.SoftmaxCrossEntropyLoss()
for i in range(epoch):
    # Reset the train data iterator.
    train_data.reset()
    # Loop over the train data iterator.
    for batch in train_data:
        # Splits train data into multiple slices along batch_axis
        # and copy each slice into a context.
        data = gluon.utils.split_and_load(batch.data[0], ctx_list=ctx, batch_axis=0)
        # Splits train labels into multiple slices along batch_axis
        # and copy each slice into a context.
        label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, batch_axis=0)
        outputs = []
        # Inside training scope
        with ag.record():
            for x, y in zip(data, label):
                z = net(x)
                # Computes softmax cross entropy loss.
                loss = softmax_cross_entropy_loss(z, y)
                # Backpropagate the error for one iteration.
                loss.backward()
                outputs.append(z)
        # Updates internal evaluation
        metric.update(label, outputs)
        # Make one step of parameter update. Trainer needs to know the
        # batch size of data to normalize the gradient by 1/batch_size.
        trainer_1.step(batch.data[0].shape[0])
        trainer_2.step(batch.data[0].shape[0])
    # Gets the evaluation result.
    name, acc = metric.get()
    # Reset evaluation result to initial state.
    metric.reset()
    print('training acc at epoch %d: %s=%f'%(i, name, acc))
training acc at epoch 0: accuracy=0.886350
training acc at epoch 1: accuracy=0.947450
training acc at epoch 2: accuracy=0.961017
training acc at epoch 3: accuracy=0.969050
training acc at epoch 4: accuracy=0.974483
training acc at epoch 5: accuracy=0.978900
training acc at epoch 6: accuracy=0.981050
training acc at epoch 7: accuracy=0.984083
training acc at epoch 8: accuracy=0.986550
training acc at epoch 9: accuracy=0.988700
CPU times: user 30.2 s, sys: 27.5 s, total: 57.7 s
Wall time: 15.4 s

Hi Thomas,

Thanks for your reply. I noticed that one kvstore per trainer only works on single node, with ‘local’ or ‘device’ mode. When I set kvstore as any variant of ‘dist’ in a multi node setting, the program just hangs. Any thoughts on why this happens?

Thanks,
Suhas

1 Like

Any updates here ? Got same problem recently.