we’re training a network for a recommender system, on <user,item,score> triplets. The core code for the fit method is as follows:
for e in range(epochs): start = time.time() cumulative_loss = 0 for i, batch in enumerate(train_iterator): # Forward + backward. with autograd.record(): output = self.model(batch.data) loss = loss_fn(output, batch.label) # Calculate gradients loss.backward() # Update parameters of the network. trainer_fn.step(batch_size) # Calculate training metrics. Sum losses of every batch. cumulative_loss += nd.mean(loss).asscalar() train_iterator.reset()
train_iterator is a custom iterator class that inherits from
mx.io.DataIter, and is returning the data (<user, item, score> triples) already in the appropriate context, as:
data = [mx.nd.array(data[:, :-1], self.ctx, dtype=np.int)] labels = [mx.nd.array(data[:, -1], self.ctx)] return mx.io.DataBatch(data, labels)
self.model.initialize(ctx=mx.gpu(0)) was also called before running the
loss_fn = gluon.loss.L1Loss().
The trouble is that
nvidia-smi reports that the process is correctly allocated into GPU. However, running
fit in GPU is not much faster than running it in CPU. In addition, increasing
batch_size from 50000 to 500000 increases time per batch by a factor of 10 (which I was not expecting, given GPU parallelization).
Specifically, for a 50k batch:
output = self.model(batch.data)takes 0.03 seconds on GPU, and 0.08 on CPU.
loss.backward()takes 0.11 seconds, and 0.39 on CPU.
both assessed with
nd.waitall() to avoid asynchronous calls leading to incorrect measurements.
In addition, a very similar code that was running on plain MXNet took less than 0.03 seconds for the corresponding part, which leads to a full epoch taking from slightly above one minute with MXNet, up to 15 minutes with Gluon.
Any ideas on what might be happening here?
Thanks in advance!