GluonCV on Jupyter: "The kernel appears to have died. It will restart automatically."


#1

Hi, I’m adapting this gluoncv demo https://gluon-cv.mxnet.io/build/examples_detection/finetune_detection.html to another dataset.
Apart from using a different dataset, I changed the model to ssd_512_resnet50_v1_custom. The Data Loader does not seem to handle those changes, and the snippet below kills jupyter (returns “The kernel appears to have died. It will restart automatically.”) on both ml.m4.2xlarge and ml.p2.xlarge Amazon SageMaker instances. What could cause that?..

def get_dataloader(net, train_dataset, data_shape, batch_size, num_workers):
    from gluoncv.data.batchify import Tuple, Stack, Pad
    from gluoncv.data.transforms.presets.ssd import SSDDefaultTrainTransform
    width, height = data_shape, data_shape
    # use fake data to generate fixed anchors for target generation
    with autograd.train_mode():
        _, _, anchors = net(mx.nd.zeros((1, 3, height, width)))
    batchify_fn = Tuple(Stack(), Stack(), Stack())  # stack image, cls_targets, box_targets
    train_loader = gluon.data.DataLoader(
        train_dataset.transform(SSDDefaultTrainTransform(width, height, anchors)),
        batch_size, True, batchify_fn=batchify_fn, last_batch='rollover', num_workers=num_workers)
    return train_loader

train_data = get_dataloader(
    net=net,
    train_dataset=gcv.data.RecordFileDetection('train.rec'),
    data_shape=600,
    batch_size=4,
    num_workers=0)

for i in train_data:
    print(i)

#2

Hi,friend,I noticed that there was something wrong with your place. for i in train_data: it should be replaced by for i in enumerate(train_data):


#3

I tried to reproduce the problem, but your code example is working fine on my p2 instance. I assume that there is an issue with your input data train.rec. Would you mind sharing the file with me, so that I can investigate why your script is failing?


#4

The reason why I could not reproduce your problem, is that I was running a different MXNet version on my P2 instance. When running on a new instance, I encountered the same issue. A Github issue has been opened: https://github.com/apache/incubator-mxnet/issues/13448