Gluoncv pikachu killing jupyter on p3.2x

olivcruche · November 27, 2018, 11:51am

I’m following this line-by-line https://gluon-cv.mxnet.io/build/examples_detection/finetune_detection.html on SageMaker p3.2xlarge conda_mxnet36 with gluoncv 0.3.0. the training loop creates a “The kernel appears to have died. It will restart automatically.”. Does that pikachu demo require a specifically massive machine? or is there something else wrong?
Cheers

olivcruche · November 27, 2018, 12:29pm

errors also on p3.8xl and m5.24xl… would be great to fix that or provide config that makes the script work. it would be a nice demo to run and adapt to real use-cases

NRauschmayr · November 27, 2018, 11:06pm

No, the demo does not require specifically massive machines. It seems there is a dependency problem. I checked the Sagemaker console and MXNet is causing a glibc failure so the Python kernel dies.

 [19:22:28] src/operator/nn/mkldnn/mkldnn_base.cc:74: Allocate 9437184 bytes with malloc directly
*** Error in `python': free(): invalid pointer: 0x00007ff8e7fb3470 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7ff9a46747e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7ff9a467d37a]

A temporary fix:
Go to the Sagemaker console:

source activate mxnet_p36
pip install mxnet-cu90

Then the demo should run fine. I will contact the sagemaker team, so that it can be properly fixed.

NRauschmayr · November 28, 2018, 1:33am

I am currently debugging the core dump. It seems that the issue is triggered by the Dataloader (more specifically by the transform). It does some invalid access to shared memory. In order to avoid the core dump, you could just modify the dataloader by not using SSDDefaultTrainTransform(width, height, anchors)

NRauschmayr · November 29, 2018, 5:38pm

https://github.com/apache/incubator-mxnet/issues/13448

Topic		Replies	Views
GluonCV on Jupyter: "The kernel appears to have died. It will restart automatically." Gluon	3	1837	November 29, 2018
Gluoncv SSD working in notebook, failing in docker on same notebook Gluon	1	561	April 2, 2020
SageMaker CPU Training: Gradient of Parameter `lstnet0_conv0_weight` on context cpu(1) has not been updated by backward since last `step` Gluon	4	861	April 2, 2019
How to do multi-gpu training on public SageMaker gluon example? Gluon	2	763	November 14, 2018
Help with DeepLab. Runing problem. Nan output Gluon gluon-cv	3	631	September 17, 2020

Gluoncv pikachu killing jupyter on p3.2x

Related Topics