Obj Detection Tutorial - #4 - train_ssd.py AWS DL-AMI v19


#1

I’m trying to get the gluoncv object detection tutorial to run (and can’t). I get a memory corruption everytime. To make this repeatable,
the tutorial: gluon-cv.mxnet.io/build/examples_detection/train_ssd_voc.html#

I am using AWS DL-AMI (p3.2xlarge) v19 (latest); Cuda 9.0.176,
source activate mxnet_p36
pip install gluoncv

mxnet_cu90mkl v1.3.0 ($ pip list | grep mxnet # shows 1.3.0.post0)
gluoncv v0.3.0
no changes at all to the DL-AMI EC2 instance

I prepped the VOC data per instructions
(mxnet_p36) …$ python train_ssd --num-workers 8 --gpus 0

[22:03:28] src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate 147456 bytes with malloc directly
[22:03:28] src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate 5760000 bytes with malloc directly
INFO:root:Namespace(batch_size=32, data_shape=300, dataset=‘voc’, epochs=240, gpus=‘0’, log_interval=100, lr=0.001, lr_decay=0.1, lr_decay_epoch=‘160,200’, momentum=0.9, network=‘vgg16_atrous’, num_workers=4, resume=’’, save_interval=10, save_prefix=‘ssd_300_vgg16_atrous_voc’, seed=233, start_epoch=0, val_interval=1, wd=0.0005)
INFO:root:Start training from [Epoch 0]
python: malloc.c:2394: sysmalloc: Assertion (old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed. python: malloc.c:2394: sysmalloc: Assertion(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)’ failed.
*** Error in `python’: malloc(): memory corruption: 0x00007fa688074c90 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fa70143f7e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8213e)[0x7fa70144a13e]I tried different parameters (no parameters) and

I tried this on multiple EC2 instances - making sure this was repeatable. Same results on every EC2 instance regardless of parameters


#2

SOLVED - this was a version problem

problems with this train_ssd.py tutorial.

  • tried on a physical server - got a RecusionError
  • then tried DL-AMI v19 mxnet_p36 environment, got memory corruptions
  • not tired DL-AMI v19 python3 environment
    • upgraded to CUDA 9.2
    • installed mxnet-cu92 (v1.3.1) & gluoncv 0.4.0; got RecursionError
    • installed mxnet-cu92 --pre (v1.4.0b20181203), gluoncv --pre (v0.4.0b20181203) - SUCCESS

lessons:

  • don’t depend on DL-AMI mxnet environment - know your version
  • if you updated CUDA on the DL-AMI (e.g. to 9.2, you have to update the LD_LIBRARY_PATH)
  • keep mxnet & gluon insync (e.g. latest versions of both - don’t mix)