Hi all,
win10 + mxnet 1.4.1 + cuda9.0 + rtx2080ti(11gb)
I am replaying the ‘train_ssd.py’ example script and encounter a cudaMalloc failed error with a batch size of 32 and 24 . Actually the error happened when I want to save down the param from the model. Is it possible to freeze the memory before saving down the params?
Below are the error messages:
INFO:root:<Mock id='1652998530440'>
INFO:root:Start training from [Epoch 0]
INFO:root:[Epoch 0][Batch 99], Speed: 60.858 samples/sec, CrossEntropy=6.831, SmoothL1=3.008
INFO:root:[Epoch 0][Batch 199], Speed: 54.608 samples/sec, CrossEntropy=5.900, SmoothL1=2.742
INFO:root:[Epoch 0][Batch 299], Speed: 59.945 samples/sec, CrossEntropy=5.513, SmoothL1=2.594
INFO:root:[Epoch 0][Batch 399], Speed: 58.295 samples/sec, CrossEntropy=5.285, SmoothL1=2.495
INFO:root:[Epoch 0][Batch 499], Speed: 55.049 samples/sec, CrossEntropy=5.112, SmoothL1=2.424
INFO:root:[Epoch 0][Batch 599], Speed: 59.277 samples/sec, CrossEntropy=4.978, SmoothL1=2.363
INFO:root:[Epoch 0] Training cost: 296.181, CrossEntropy=4.875, SmoothL1=2.326
INFO:root:[Epoch 0] Validation:
aeroplane=0.28163309987005836
bicycle=0.08652106730815054
bird=0.05896344456695527
boat=0.014805535671677405
bottle=0.09258335708178969
bus=0.04532720834989938
car=0.527245343705956
cat=0.37748727525376286
chair=0.08751348968999263
cow=0.17627388929086507
diningtable=0.018181818181818184
dog=0.23678240474526713
horse=0.11705849228006736
motorbike=0.12771836007130127
person=0.4574926315666286
pottedplant=0.011363636363636364
sheep=0.15096558584962233
sofa=0.15207774118599662
train=0.12716020303255735
tvmonitor=0.11322085870377328
mAP=0.16301877213848878
C:\Users\kuent\Anaconda3\envs\mxnet36\lib\site-packages\mxnet\gluon\block.py:345: UserWarning: save_params is deprecated. Please use save_parameters. Note that if you want load from SymbolBlock later, please use export instead. For details, see https://mxnet.incubator.apache.org/tutorials/gluon/save_load_params.html
warnings.warn("save_params is deprecated. Please use save_parameters. "
---------------------------------------------------------------------------
MXNetError Traceback (most recent call last)
<ipython-input-78-386fd257e082> in <module>
----> 1 train(net, train_data, val_data, eval_metric, ctx, args)
<ipython-input-68-b8f160eb5553> in train(net, train_data, val_data, eval_metric, ctx, args)
51 box_preds.append(box_pred)
52 sum_loss, cls_loss, box_loss = mbox_loss(
---> 53 cls_preds, box_preds, cls_targets, box_targets)
54 autograd.backward(sum_loss)
55 # since we have already normalized the loss, we don't want to normalize
~\Anaconda3\envs\mxnet36\lib\site-packages\mxnet\gluon\block.py in __call__(self, *args)
538 hook(self, args)
539
--> 540 out = self.forward(*args)
541
542 for hook in self._forward_hooks.values():
~\Anaconda3\envs\mxnet36\lib\site-packages\gluoncv\loss.py in forward(self, cls_pred, box_pred, cls_target, box_target)
138 pos_samples = (ct > 0)
139 num_pos.append(pos_samples.sum())
--> 140 num_pos_all = sum([p.asscalar() for p in num_pos])
141 if num_pos_all < 1 and self._min_hard_negatives < 1:
142 # no positive samples and no hard negatives, return dummy losses
~\Anaconda3\envs\mxnet36\lib\site-packages\gluoncv\loss.py in <listcomp>(.0)
138 pos_samples = (ct > 0)
139 num_pos.append(pos_samples.sum())
--> 140 num_pos_all = sum([p.asscalar() for p in num_pos])
141 if num_pos_all < 1 and self._min_hard_negatives < 1:
142 # no positive samples and no hard negatives, return dummy losses
~\Anaconda3\envs\mxnet36\lib\site-packages\mxnet\ndarray\ndarray.py in asscalar(self)
1996 if self.shape != (1,):
1997 raise ValueError("The current array is not a scalar")
-> 1998 return self.asnumpy()[0]
1999
2000 def astype(self, dtype, copy=True):
~\Anaconda3\envs\mxnet36\lib\site-packages\mxnet\ndarray\ndarray.py in asnumpy(self)
1978 self.handle,
1979 data.ctypes.data_as(ctypes.c_void_p),
-> 1980 ctypes.c_size_t(data.size)))
1981 return data
1982
~\Anaconda3\envs\mxnet36\lib\site-packages\mxnet\base.py in check_call(ret)
250 """
251 if ret != 0:
--> 252 raise MXNetError(py_str(_LIB.MXGetLastError()))
253
254
MXNetError: [10:28:35] c:\jenkins\workspace\mxnet-tag\mxnet\src\storage\./pooled_storage_manager.h:151: cudaMalloc failed: out of memory