cudaMalloc failed: out of memory when replaying the train_ssd.py example

Hi all,

win10 + mxnet 1.4.1 + cuda9.0 + rtx2080ti(11gb)

I am replaying the ‘train_ssd.py’ example script and encounter a cudaMalloc failed error with a batch size of 32 and 24 . Actually the error happened when I want to save down the param from the model. Is it possible to freeze the memory before saving down the params?

Below are the error messages:

INFO:root:<Mock id='1652998530440'>
INFO:root:Start training from [Epoch 0]
INFO:root:[Epoch 0][Batch 99], Speed: 60.858 samples/sec, CrossEntropy=6.831, SmoothL1=3.008
INFO:root:[Epoch 0][Batch 199], Speed: 54.608 samples/sec, CrossEntropy=5.900, SmoothL1=2.742
INFO:root:[Epoch 0][Batch 299], Speed: 59.945 samples/sec, CrossEntropy=5.513, SmoothL1=2.594
INFO:root:[Epoch 0][Batch 399], Speed: 58.295 samples/sec, CrossEntropy=5.285, SmoothL1=2.495
INFO:root:[Epoch 0][Batch 499], Speed: 55.049 samples/sec, CrossEntropy=5.112, SmoothL1=2.424
INFO:root:[Epoch 0][Batch 599], Speed: 59.277 samples/sec, CrossEntropy=4.978, SmoothL1=2.363
INFO:root:[Epoch 0] Training cost: 296.181, CrossEntropy=4.875, SmoothL1=2.326
INFO:root:[Epoch 0] Validation:
aeroplane=0.28163309987005836
bicycle=0.08652106730815054
bird=0.05896344456695527
boat=0.014805535671677405
bottle=0.09258335708178969
bus=0.04532720834989938
car=0.527245343705956
cat=0.37748727525376286
chair=0.08751348968999263
cow=0.17627388929086507
diningtable=0.018181818181818184
dog=0.23678240474526713
horse=0.11705849228006736
motorbike=0.12771836007130127
person=0.4574926315666286
pottedplant=0.011363636363636364
sheep=0.15096558584962233
sofa=0.15207774118599662
train=0.12716020303255735
tvmonitor=0.11322085870377328
mAP=0.16301877213848878
C:\Users\kuent\Anaconda3\envs\mxnet36\lib\site-packages\mxnet\gluon\block.py:345: UserWarning: save_params is deprecated. Please use save_parameters. Note that if you want load from SymbolBlock later, please use export instead. For details, see https://mxnet.incubator.apache.org/tutorials/gluon/save_load_params.html
  warnings.warn("save_params is deprecated. Please use save_parameters. "
---------------------------------------------------------------------------
MXNetError                                Traceback (most recent call last)
<ipython-input-78-386fd257e082> in <module>
----> 1 train(net, train_data, val_data, eval_metric, ctx, args)

<ipython-input-68-b8f160eb5553> in train(net, train_data, val_data, eval_metric, ctx, args)
     51                     box_preds.append(box_pred)
     52                 sum_loss, cls_loss, box_loss = mbox_loss(
---> 53                     cls_preds, box_preds, cls_targets, box_targets)
     54                 autograd.backward(sum_loss)
     55             # since we have already normalized the loss, we don't want to normalize

~\Anaconda3\envs\mxnet36\lib\site-packages\mxnet\gluon\block.py in __call__(self, *args)
    538             hook(self, args)
    539
--> 540         out = self.forward(*args)
    541
    542         for hook in self._forward_hooks.values():

~\Anaconda3\envs\mxnet36\lib\site-packages\gluoncv\loss.py in forward(self, cls_pred, box_pred, cls_target, box_target)
    138             pos_samples = (ct > 0)
    139             num_pos.append(pos_samples.sum())
--> 140         num_pos_all = sum([p.asscalar() for p in num_pos])
    141         if num_pos_all < 1 and self._min_hard_negatives < 1:
    142             # no positive samples and no hard negatives, return dummy losses

~\Anaconda3\envs\mxnet36\lib\site-packages\gluoncv\loss.py in <listcomp>(.0)
    138             pos_samples = (ct > 0)
    139             num_pos.append(pos_samples.sum())
--> 140         num_pos_all = sum([p.asscalar() for p in num_pos])
    141         if num_pos_all < 1 and self._min_hard_negatives < 1:
    142             # no positive samples and no hard negatives, return dummy losses

~\Anaconda3\envs\mxnet36\lib\site-packages\mxnet\ndarray\ndarray.py in asscalar(self)
   1996         if self.shape != (1,):
   1997             raise ValueError("The current array is not a scalar")
-> 1998         return self.asnumpy()[0]
   1999
   2000     def astype(self, dtype, copy=True):

~\Anaconda3\envs\mxnet36\lib\site-packages\mxnet\ndarray\ndarray.py in asnumpy(self)
   1978             self.handle,
   1979             data.ctypes.data_as(ctypes.c_void_p),
-> 1980             ctypes.c_size_t(data.size)))
   1981         return data
   1982

~\Anaconda3\envs\mxnet36\lib\site-packages\mxnet\base.py in check_call(ret)
    250     """
    251     if ret != 0:
--> 252         raise MXNetError(py_str(_LIB.MXGetLastError()))
    253
    254

MXNetError: [10:28:35] c:\jenkins\workspace\mxnet-tag\mxnet\src\storage\./pooled_storage_manager.h:151: cudaMalloc failed: out of memory

This seems to happen after you do the next training loop though based on your code sample? Do you confirm that the parameters are not saved?
Can you try to do maybe

net.collect_params().reset_ctx(mx.cpu())
net.save_parameters('my_parameters.params')
net.collect_params().reset_ctx(ctx)

My GPU memory utilization on this script was ~12.5GB (with default settings, so batch size of 32). Are you having any luck with smaller batch sizes (less than 24)?

The parameters are actually saved. So it is able to do one epoch. But then why it is needed to allocate additional memory on the gpu?

Yes, I need to set batch_size to 16 to be able to train the model to 10 epochs.Memory usage is around 10gb. Looks like it is normal to have at least 13gb on gpu to be able to replay this script with default settings.

A couple of ideas for why GPU memory utilization might increase after the first epoch… could this be the first time network validation is run? And this might need additional memory for post processing and metrics? Another slightly strange step I see in the script to to hybridize the model on each epoch, which might be responsible for the increased memory usage.

Sorry for the late reply.

Another slightly strange step I see in the script to to hybridize the model on each epoch
I actually turned off hybridization and still got the memory error which is quite strange. But with multi gpu the scripts with default setting should be fine. Let me try and report back.