Example SSD cannot load model if trained with resnet50

I trained an SSD model via SageMaker that is AFAIK uses the https://github.com/apache/incubator-mxnet/tree/master/example/ssd code.

After training the model is expected to be converted to a “deployable” state which removes the loss symbols by running the deploy.py script. Afterwards, I load the model with the following code:

sym, arg_params, aux_params = mx.model.load_checkpoint(‘deploy_ssd_vgg16_reduced_512’, 0)
mod = mx.mod.Module(symbol=sym, context=ctx, label_names=None)
mod.bind(for_training=False, data_shapes=[(‘data’, (1,3,512,512))],
label_shapes=mod._label_shapes)
mod.set_params(arg_params, aux_params, allow_extra=True)

This works fine as long as the model is trained with a VGG feature extractor. However, Sagemaker (and hence the example code) allows training with resnet50 which produces a model that can be converted with deploy.py but the resulting model cannot be loaded anymore with the above code. The error I am getting is:

RuntimeError: _plus12_cls_pred_conv_bias is not presented

And indeed the BN params and few other are missing from the param file. Maybe the deploy script is bugged with resnet50?

hey,

so looking at the deploy script it seems it gets the network symbols from https://github.com/apache/incubator-mxnet/blob/master/example/ssd/symbol/symbol_factory.py so there might be a bug in the config definitions for resnet. Haven’t been able to pin-point what exactly though

Thanks for the reply. Turns out it was a SageMaker bug producing wrong model files.

I’m facing an error on the same topic:

import mxnet as mx
ctx = mx.cpu()

sym, arg_params, aux_params = mx.model.load_checkpoint('model_algo_1', 0)
mod = mx.mod.Module(symbol=sym, context=ctx, label_names=None)
mod.bind(for_training=False, data_shapes=[('data', (1,3,500,500))],
label_shapes=mod._label_shapes)
mod.set_params(arg_params, aux_params, allow_extra=True)

returns

RuntimeError: simple_bind error. Arguments:
data: (1, 3, 500, 500)
Error in operator multibox_target: [22:44:44] src/operator/contrib/./multibox_target-inl.h:225: Check failed: lshape.ndim() == 3 (0 vs. 3) Label should be [batch, num_labels, label_width] tensor