Deploy Sagemaker Trained Model locally?


#1

Hello,

I have trained a model on Sagemaker using the Object Detection Algorithm - https://docs.aws.amazon.com/sagemaker/latest/dg/object-detection.html

This has given me 3 files:
hyperparameters.json
model_algo_1-0000.params
model_algo_1-symbol.json

I can deploy an inference endpoint for this on Sagemaker without any issues. But I am not able to deploy the same model locally.

I am following the following blog - https://aws.amazon.com/blogs/machine-learning/build-a-real-time-object-classification-system-with-apache-mxnet-on-raspberry-pi/

But, any value that I provide for --label-name argument throws out an error. How would one go about deploying a Sagemaker trained model for Object Detection Algorithm locally? The final goal is to be able to deploy the inference endpoint to a Raspberry Pi.

Any help is appreciated. Feel free to let me know if you need more details.

Thanks,
Vinay Nadig


#2

Can you elaborate on “any value that I provide for --label-name argument throws out an error”? A code snippet from what you tried and the error message you got will help.


#3

Hi Indu,

Here is the script I am using(the same one in the blog) - https://pastebin.com/NC7Qg6dy
Here is my model_algo_1-symbol.json file - https://pastebin.com/mPaR1zYx
Here is the command that I am trying to run:

python load_model.py --label-name 'cls_prob' --img 'test.jpg' --prefix 'model_algo_1' --synset 'synset.txt' 

synset.txt contains just one word in a single line - ‘car’ which is the only object the model is trained to detect.

The error I am getting is this:

[02:49:20] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v1.2.1. Attempting to upgrade...
[02:49:20] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
/usr/local/lib/python2.7/dist-packages/mxnet/module/base_module.py:55: UserWarning: You created Module with Module(..., label_names=['cls_prob']) but input with name 'cls_prob' is not found in symbol.list_arguments(). Did you mean one of:
	relu4_3_scale
	data
	label
  warnings.warn(msg)
/usr/local/lib/python2.7/dist-packages/mxnet/module/base_module.py:67: UserWarning: Data provided by label_shapes don't match names specified by label_names ([] vs. ['cls_prob'])
  warnings.warn(msg)
Traceback (most recent call last):
  File "load_model.py", line 105, in <module>
    mod = ImagenetModel(args.synset, args.prefix, label_names=[args.label_name], params_url=args.params_url, symbol_url=args.symbol_url, synset_url=args.synset_url)
  File "load_model.py", line 45, in __init__
    self.mod.bind(for_training=False, data_shapes= input_shapes)
  File "/usr/local/lib/python2.7/dist-packages/mxnet/module/module.py", line 429, in bind
    state_names=self._state_names)
  File "/usr/local/lib/python2.7/dist-packages/mxnet/module/executor_group.py", line 279, in __init__
    self.bind_exec(data_shapes, label_shapes, shared_group)
  File "/usr/local/lib/python2.7/dist-packages/mxnet/module/executor_group.py", line 375, in bind_exec
    shared_group))
  File "/usr/local/lib/python2.7/dist-packages/mxnet/module/executor_group.py", line 662, in _bind_ith_exec
    shared_buffer=shared_data_arrays, **input_shapes)
  File "/usr/local/lib/python2.7/dist-packages/mxnet/symbol/symbol.py", line 1528, in simple_bind
    raise RuntimeError(error_msg)
RuntimeError: simple_bind error. Arguments:
data: (1, 3, 224, 224)
Error in operator multi_feat_5_conv_3x3_conv: [02:49:20] src/operator/nn/convolution.cc:193: Check failed: dilated_ksize_y <= AddPad(dshape[2], param_.pad[0]) (3 vs. 2) kernel size exceed input

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1befb4) [0x7fc7321cefb4]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x1bf391) [0x7fc7321cf391]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x4a1d86) [0x7fc7324b1d86]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x29b3fba) [0x7fc7349c3fba]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x29b6964) [0x7fc7349c6964]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x29a2efa) [0x7fc7349b2efa]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x29a3a34) [0x7fc7349b3a34]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(MXExecutorSimpleBind+0x2368) [0x7fc73490ee48]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fc741fe1e40]
[bt] (9) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x2eb) [0x7fc741fe18ab]

Let me know if you would like access to all 3 files(hyperparameters.json, model_algo_1-0000.params, model_algo_1-symbol.json) and I will share it with you.


#4

@VishaalKapoor @thomelane


#5

Hi @vinaynadig,

It looks like the symbol ‘upgrade’ is getting things in muddle here! It looks like the model was saved with a different version of MXNet <v1.2.1, and now you’re loading the model back with a version of MXNet >v1.2.1. Upgrade is required because there was a change in the model serialisation format, but it’s seems to change the model architecture too (at least the naming of symbols).

You have a symbol called cls_prob in model_algo_1-symbol.json but after the upgrade you have symbol called label. I can think of couple of things to try here:

  1. You could try changing --label-name 'label'
  2. You could downgrade to v1.2.1 if you want to keep the load_model.py unchanged.
  3. Change load_model.py to use load and load_params, instead of load_checkpoint.

#6

Hi Thom,

Thanks for the pointers! I was able to get it work with the following changes:

  1. Strip the model of training layers as defined here - https://github.com/zhreshold/mxnet-ssd#convert-model-to-deploy-mode
  2. Downgrade mxnet to v1.2.1.
  3. Make sure that when running deploy.py from mxnet-ssd repo, these arguments match the hyperparameters that were entered during training the model - ‘nms’
  4. In the load_model.py, set the input_shapes value to (3, 300, 300). This is because sagemaker recommends either 300 or 512 as the value for image_shape hyperparameter and I had used 300 as the value during training.

With these changes, I am able to use the model generated by Sagemaker locally.


#7

Glad you managed to get it working, and thanks for the follow up post!