SSD Finetuning with Resnet50

I am trying to finetune a model using resnet50 512. I have around ~850 training images, and when I run train.py, the validation/mAP starts low and quickly gets to around .25-.30 around 70 epochs. Then it seems to stay there indefinitely.

I am using the official apache incubator repo, but I had to make some code changes to get it to work.
Specifically I modified this block of code in train_net.py to match this version in order to remove layers which seem to not match with the pretrained model.
I used this command to start the training

python train.py --network resnet50 --train-path data/train.rec --val-path data/test.rec --class-names text_block --num-class 1 --data-shape 512 --lr 0.0001 --finetune 1 --end-epoch 1000 --gpus 0 --val-list test

I stopped the training at epoch 457. Here is the full log
https://s3.amazonaws.com/read-to-me-dataset/train.log

sample output from my logs during training.

INFO:root:Epoch[216] Validation-text_block=0.277103
INFO:root:Epoch[216] Validation-mAP=0.277103
INFO:root:Epoch[217] Batch [20] Speed: 92.03 samples/sec CrossEntropy=0.573019 SmoothL1=0.505079
INFO:root:Epoch[217] Train-CrossEntropy=0.568360
INFO:root:Epoch[217] Train-SmoothL1=0.480534
INFO:root:Epoch[217] Time cost=8.733
INFO:root:Saved checkpoint to “/home/aschu/development/apache-incubator/mxnet/incubator-mxnet/example/ssd/model/ssd_resnet50_512-0218.params”
INFO:root:Epoch[217] Validation-text_block=0.266455
INFO:root:Epoch[217] Validation-mAP=0.266455
INFO:root:Epoch[218] Batch [20] Speed: 97.72 samples/sec CrossEntropy=0.570998 SmoothL1=0.501688
INFO:root:Epoch[218] Train-CrossEntropy=0.566719
INFO:root:Epoch[218] Train-SmoothL1=0.494521
INFO:root:Epoch[218] Time cost=7.926
INFO:root:Saved checkpoint to “/home/aschu/development/apache-incubator/mxnet/incubator-mxnet/example/ssd/model/ssd_resnet50_512-0219.params”
INFO:root:Epoch[218] Validation-text_block=0.237516
INFO:root:Epoch[218] Validation-mAP=0.237516
INFO:root:Epoch[219] Batch [20] Speed: 95.20 samples/sec CrossEntropy=0.570789 SmoothL1=0.487176
INFO:root:Epoch[219] Train-CrossEntropy=0.570646
INFO:root:Epoch[219] Train-SmoothL1=0.475513
INFO:root:Epoch[219] Time cost=8.406
INFO:root:Saved checkpoint to “/home/aschu/development/apache-incubator/mxnet/incubator-mxnet/example/ssd/model/ssd_resnet50_512-0220.params”
INFO:root:Epoch[219] Validation-text_block=0.259088
INFO:root:Epoch[219] Validation-mAP=0.259088

Hi Alex,

I never worked with this implementation of SSD, but most probably you have confused 2 mutually exclusive parameters.

In your code you are using finetune parameter, which is, according to help doc “finetune from epoch n, rename the model before doing this”. I guess what you really want is to use a pretrained network and then fine-tune it. To do that you need to use pretrained parameter, which loads a base network in pretrained mode . Take a look into this line to see what it does https://github.com/apache/incubator-mxnet/blob/master/example/ssd/train/train_net.py#L222

So, I would recommend to change the parameter and see if it helped you (you probably want to do it without your code modifications, just to control the number of possible errors).

Hey, thanks for the reply. I have tried this and unfortunately it does not seem to work. I also found this github issue from a while back which explains the use of the pretrained flag vs finetune flag.

This is the error I get when I use that flag

python train.py --network resnet50 --train-path data/train.rec --val-path data/test.rec --class-names text_block --num-class 1 --data-shape 512 --pretrained model/ssd_resnet50_512 --end-epoch 1000 --gpus 0 --val-list test
[22:28:18] src/io/iter_image_det_recordio.cc:281: ImageDetRecordIOParser: data/train.rec, use 11 threads for decoding…
[22:28:18] src/io/iter_image_det_recordio.cc:334: ImageDetRecordIOParser: data/train.rec, label padding width: 350
[22:28:19] src/io/iter_image_det_recordio.cc:281: ImageDetRecordIOParser: data/test.rec, use 11 threads for decoding…
[22:28:19] src/io/iter_image_det_recordio.cc:334: ImageDetRecordIOParser: data/test.rec, label padding width: 350
INFO:root:Start training with (gpu(0)) from pretrained model model/ssd_resnet50_512
[22:28:22] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while… (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Traceback (most recent call last):
File “train.py”, line 156, in
tensorboard=args.tensorboard)
File “/home/aschu/development/apache-incubator/mxnet/incubator-mxnet/example/ssd/train/train_net.py”, line 367, in train_net
monitor=monitor)
File “/home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/module/base_module.py”, line 464, in fit
allow_missing=allow_missing, force_init=force_init)
File “/home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/module/module.py”, line 308, in init_params
_impl(desc, arr, arg_params)
File “/home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/module/module.py”, line 296, in _impl
cache_arr.copyto(arr)
File “/home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/ndarray/ndarray.py”, line 1876, in copyto
return _internal._copyto(self, out=other)
File “”, line 25, in _copyto
File “/home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/_ctypes/ndarray.py”, line 92, in _imperative_invoke
ctypes.byref(out_stypes)))
File “/home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/base.py”, line 146, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [22:28:29] src/operator/nn/./…/tensor/…/elemwise_op_common.h:123: Check failed: assign(&dattr, (*vec)[i]) Incompatible attr in node at 0-th output: expected [84,128,3,3], got [8,128,3,3]

Stack trace returned 10 entries:
[bt] (0) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2a9e78) [0x7f41963afe78]
[bt] (1) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2aa288) [0x7f41963b0288]
[bt] (2) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x5106d6) [0x7f41966166d6]
[bt] (3) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x51192b) [0x7f419661792b]
[bt] (4) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x24bf0ea) [0x7f41985c50ea]
[bt] (5) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x24c1c19) [0x7f41985c7c19]
[bt] (6) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2403a7b) [0x7f4198509a7b]
[bt] (7) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(MXImperativeInvokeEx+0x63) [0x7f4198509fe3]
[bt] (8) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7f41d98b7e20]
[bt] (9) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7f41d98b788b]

I received a similar error with the finetune flag before I modified train_net.py