Gluoncv pikachu killing jupyter on p3.2x


#1

I’m following this line-by-line https://gluon-cv.mxnet.io/build/examples_detection/finetune_detection.html on SageMaker p3.2xlarge conda_mxnet36 with gluoncv 0.3.0. the training loop creates a “The kernel appears to have died. It will restart automatically.”. Does that pikachu demo require a specifically massive machine? or is there something else wrong?
Cheers


#2

errors also on p3.8xl and m5.24xl… would be great to fix that or provide config that makes the script work. it would be a nice demo to run and adapt to real use-cases


#3

No, the demo does not require specifically massive machines. It seems there is a dependency problem. I checked the Sagemaker console and MXNet is causing a glibc failure so the Python kernel dies.

 [19:22:28] src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate 9437184 bytes with malloc directly
*** Error in `python': free(): invalid pointer: 0x00007ff8e7fb3470 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7ff9a46747e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7ff9a467d37a]

A temporary fix:
Go to the Sagemaker console:

source activate mxnet_p36
pip install mxnet-cu90

Then the demo should run fine. I will contact the sagemaker team, so that it can be properly fixed.


#4

I am currently debugging the core dump. It seems that the issue is triggered by the Dataloader (more specifically by the transform). It does some invalid access to shared memory. In order to avoid the core dump, you could just modify the dataloader by not using SSDDefaultTrainTransform(width, height, anchors)


#5

https://github.com/apache/incubator-mxnet/issues/13448