Example SSD cannot load model if trained with resnet50

ifeherva · October 16, 2018, 5:31pm

I trained an SSD model via SageMaker that is AFAIK uses the https://github.com/apache/incubator-mxnet/tree/master/example/ssd code.

After training the model is expected to be converted to a “deployable” state which removes the loss symbols by running the deploy.py script. Afterwards, I load the model with the following code:

sym, arg_params, aux_params = mx.model.load_checkpoint(‘deploy_ssd_vgg16_reduced_512’, 0)
mod = mx.mod.Module(symbol=sym, context=ctx, label_names=None)
mod.bind(for_training=False, data_shapes=[(‘data’, (1,3,512,512))],
label_shapes=mod._label_shapes)
mod.set_params(arg_params, aux_params, allow_extra=True)

This works fine as long as the model is trained with a VGG feature extractor. However, Sagemaker (and hence the example code) allows training with resnet50 which produces a model that can be converted with deploy.py but the resulting model cannot be loaded anymore with the above code. The error I am getting is:

RuntimeError: _plus12_cls_pred_conv_bias is not presented

And indeed the BN params and few other are missing from the param file. Maybe the deploy script is bugged with resnet50?

sad · October 18, 2018, 12:29am

hey,

so looking at the deploy script it seems it gets the network symbols from https://github.com/apache/incubator-mxnet/blob/master/example/ssd/symbol/symbol_factory.py so there might be a bug in the config definitions for resnet. Haven’t been able to pin-point what exactly though

ifeherva · October 18, 2018, 11:57pm

Thanks for the reply. Turns out it was a SageMaker bug producing wrong model files.

olivcruche · July 19, 2019, 10:46pm

I’m facing an error on the same topic:

import mxnet as mx
ctx = mx.cpu()

sym, arg_params, aux_params = mx.model.load_checkpoint('model_algo_1', 0)
mod = mx.mod.Module(symbol=sym, context=ctx, label_names=None)
mod.bind(for_training=False, data_shapes=[('data', (1,3,500,500))],
label_shapes=mod._label_shapes)
mod.set_params(arg_params, aux_params, allow_extra=True)

returns

RuntimeError: simple_bind error. Arguments:
data: (1, 3, 500, 500)
Error in operator multibox_target: [22:44:44] src/operator/contrib/./multibox_target-inl.h:225: Check failed: lshape.ndim() == 3 (0 vs. 3) Label should be [batch, num_labels, label_width] tensor

Topic		Replies	Views
Loading model from .params and .json fails Gluon	14	4979	July 24, 2019
Two instances of SSD on the same script Gluon	1	350	July 26, 2019
Issue finetuning SSD example	1	1552	March 19, 2018
Load checkpoint and train Gluon	1	1275	July 19, 2019
Deploy Sagemaker Trained Model locally?	6	4065	October 4, 2018

Example SSD cannot load model if trained with resnet50

Related Topics