Training on GPU not much faster than on CPU

Hello everybody,

we’re training a network for a recommender system, on <user,item,score> triplets. The core code for the fit method is as follows:

for e in range(epochs):
    start = time.time()

    cumulative_loss = 0

    for i, batch in enumerate(train_iterator):
        # Forward + backward.
        with autograd.record():
            output = self.model(batch.data[0])
            loss = loss_fn(output, batch.label[0])

        # Calculate gradients
        loss.backward()
        # Update parameters of the network.
        trainer_fn.step(batch_size)
        # Calculate training metrics. Sum losses of every batch.
        cumulative_loss += nd.mean(loss).asscalar()
    train_iterator.reset()

where the train_iterator is a custom iterator class that inherits from mx.io.DataIter, and is returning the data (<user, item, score> triples) already in the appropriate context, as:

        data = [mx.nd.array(data[:, :-1], self.ctx, dtype=np.int)]
        labels = [mx.nd.array(data[:, -1], self.ctx)]
        return mx.io.DataBatch(data, labels)

self.model.initialize(ctx=mx.gpu(0)) was also called before running the fit method. loss_fn = gluon.loss.L1Loss().

The trouble is that nvidia-smi reports that the process is correctly allocated into GPU. However, running fit in GPU is not much faster than running it in CPU. In addition, increasing batch_size from 50000 to 500000 increases time per batch by a factor of 10 (which I was not expecting, given GPU parallelization).

Specifically, for a 50k batch:

  • output = self.model(batch.data[0]) takes 0.03 seconds on GPU, and 0.08 on CPU.
  • loss.backward() takes 0.11 seconds, and 0.39 on CPU.

both assessed with nd.waitall() to avoid asynchronous calls leading to incorrect measurements.

In addition, a very similar code that was running on plain MXNet took less than 0.03 seconds for the corresponding part, which leads to a full epoch taking from slightly above one minute with MXNet, up to 15 minutes with Gluon.

Any ideas on what might be happening here?

Thanks in advance!

Although there can be many reasons, two that come to mind:

  1. It is a very small network and therefor you should keep it as async as possible in order to be able to utilise the GPU. In the above case the thing to improve would be remove te asscalar() since this is a blocking operation. So perhaps test it without that line and see if there are substantial improvements. So if your CPU load is low and your GPU load is low => this could be logical step to try.

  2. You feature preparation is CPU bound and you cannot feed quickly enough data to the model. A DataLoader would provide some additional scalability since it can work with multiple workers (however based on the snippet you provided it doesn’t look like a very heavy feature preparation step).
    So if your single CPU load is high but your GPU load is low => this is a logical step to try.

Thanks for taking the time to reply.

Indeed, the network is very small. It’s just a cosine product of 64-sized embeddings.

In fact, in further tests we have done, removing the cumulative_loss += nd.mean(loss).asscalar() line implies a huge improvement, but I feel that’s only deferring the calculation of the .backward() step until whenever loss is used. In fact, storing an array with all the loss values and computing the nd.mean() at the end basically entails that the batches are run through quite quickly, but then the process spends quite some time in the final nd.mean() calculation, at the end of the epoch.

When removing the asscalar() however, and with a large number of batches, I end up running out of GPU memory, due I assume to too many operations being enqueued. With a smaller number of batches there were some improvements in overall time per epoch, but not close enough to the performance I obtained with the MXNet code.

The most important part of the feature preparation is that random negative samples are being added, but that is managed via a queue that is included into the custom iterator, so that batches are enqueued and later dequeued. After doing a couple optimizations in the data loading part, single CPU usage is 100%, GPU load is close to 100%, too (at least as reported by nvidia-smi, under the Volatile GPU-Util column).

I’m not sure if this means something, but after replenishing the queue (which takes some 0.5 seconds), the next batch seems to be processed much quicker, in 0.003 seconds (vs 0.11 seconds for “regular” batches, both with the asscalar()).

What I meant to say is don’t invoke asscaler() unless you really need to, but do use mean (or sum for that matter if you really wan the overall cum loss ).

So the code could look something like this (not tried):

cumulative_loss += loss.mean().detach()
if (i % 500) == 0: print(cumulative_loss.asscalar())

I added the detach() to make sure you don’t keep a reference to the complete graph which might cause GPU memory problems.

Thanks for the clarification.

I just tried that (with the detach()) and updating the loss only every 500 iterations, taking up some 8GB of GPU memory. However, time per epoch is still ~900 seconds, whereas the MXNet implementation took some 90 seconds, with 295MB… We’re talking about an order of magnitude here.

There’s still something fishy I’m missing. I read somewhere about loss.hybridization(). Do you think that might help? Also, self.model.hybridization() didn’t seem to provide much improvement, even though I thought it would.

Ok, so I finally got an important breakthrough.

The former MXNet code was implemented using mx.symbol.Embedding([...], sparse_grad=True). I included sparse gradients for the Embedding layer in the Gluon model and time per epoch dropped from 900 seconds to 120. Still not the 90 seconds of the MXNet implementation, but at least it’s not an order of magnitude away.

Hi @GSanchis, I think you’ll find the MXNet Profiler useful for this. You can certainly hybridize the loss function, but usually this only helps if your loss function is composed of multiple operators. Another tip is when you hybridize the network, try setting static_alloc = True and static_shape = True.

1 Like

Try using mx.io.NDArrayIter(data, label, batch_size, shuffle), I guess it’s faster than mx.io.DataBatch.

PS: Sorry if it’s not.

@thomelane Thanks for the heads up! I believe I had used the MXNet profiler in the past, but I seem to remember that back then you even had to compile MXNet with a special option. I think. That was some time ago, though. Regarding static_alloc = True and static_shape = True, I have already included that into the code, although that didn’t seem to imply a big improvement.

@mouryarishik But isn’t the purpose of NDArrayIter different from DataBatch? As far as I know, DataBatch is… a batch, but NDArrayIter is an iterator. Given that the iterator I’m using needs to return data that is not necessarily in the training data (i.e., negative data), and it might also be that the amount of data does not actually fit into memory, I think using NDArrayIter would not be an option.

@GSanchis you are correct. Sorry for misunderstanding. I forgot what databatch does off the top of my head.

Yep, things have changed on that front. You don’t need to re-compile anymore. You can just import profiler and use as follows:

from mxnet import profiler


profiler.set_config(profile_all=True, aggregate_stats=True, filename='profile_output.json')

# Ask the profiler to start recording
profiler.set_state('run')

# code to profile goes here
# usually you only want a few batches worth

# Ask the profiler to stop recording after operations have completed
mx.nd.waitall()
profiler.set_state('stop')

# save to file
profiler.dump()
1 Like

Sometimes this statistics can be deceptive. It shows the percentage of time at least one kernel is running on the GPU (i.e. could just be one kernel), so you might have additional capacity on the GPU for parallel processing even if it shows up as 100%. Also time spent on memcpy also get included in this statistic.

You might find the following videos useful too:

1 Like