SSD Finetuning with Resnet50


#1

I am trying to finetune a model using resnet50 512. I have around ~850 training images, and when I run train.py, the validation/mAP starts low and quickly gets to around .25-.30 around 70 epochs. Then it seems to stay there indefinitely.

I am using the official apache incubator repo, but I had to make some code changes to get it to work.
Specifically I modified this block of code in train_net.py to match this version in order to remove layers which seem to not match with the pretrained model.
I used this command to start the training

python train.py --network resnet50 --train-path data/train.rec --val-path data/test.rec --class-names text_block --num-class 1 --data-shape 512 --lr 0.0001 --finetune 1 --end-epoch 1000 --gpus 0 --val-list test

I stopped the training at epoch 457. Here is the full log
https://s3.amazonaws.com/read-to-me-dataset/train.log

sample output from my logs during training.

INFO:root:Epoch[216] Validation-text_block=0.277103
INFO:root:Epoch[216] Validation-mAP=0.277103
INFO:root:Epoch[217] Batch [20] Speed: 92.03 samples/sec CrossEntropy=0.573019 SmoothL1=0.505079
INFO:root:Epoch[217] Train-CrossEntropy=0.568360
INFO:root:Epoch[217] Train-SmoothL1=0.480534
INFO:root:Epoch[217] Time cost=8.733
INFO:root:Saved checkpoint to “/home/aschu/development/apache-incubator/mxnet/incubator-mxnet/example/ssd/model/ssd_resnet50_512-0218.params”
INFO:root:Epoch[217] Validation-text_block=0.266455
INFO:root:Epoch[217] Validation-mAP=0.266455
INFO:root:Epoch[218] Batch [20] Speed: 97.72 samples/sec CrossEntropy=0.570998 SmoothL1=0.501688
INFO:root:Epoch[218] Train-CrossEntropy=0.566719
INFO:root:Epoch[218] Train-SmoothL1=0.494521
INFO:root:Epoch[218] Time cost=7.926
INFO:root:Saved checkpoint to “/home/aschu/development/apache-incubator/mxnet/incubator-mxnet/example/ssd/model/ssd_resnet50_512-0219.params”
INFO:root:Epoch[218] Validation-text_block=0.237516
INFO:root:Epoch[218] Validation-mAP=0.237516
INFO:root:Epoch[219] Batch [20] Speed: 95.20 samples/sec CrossEntropy=0.570789 SmoothL1=0.487176
INFO:root:Epoch[219] Train-CrossEntropy=0.570646
INFO:root:Epoch[219] Train-SmoothL1=0.475513
INFO:root:Epoch[219] Time cost=8.406
INFO:root:Saved checkpoint to “/home/aschu/development/apache-incubator/mxnet/incubator-mxnet/example/ssd/model/ssd_resnet50_512-0220.params”
INFO:root:Epoch[219] Validation-text_block=0.259088
INFO:root:Epoch[219] Validation-mAP=0.259088


#2

Hi Alex,

I never worked with this implementation of SSD, but most probably you have confused 2 mutually exclusive parameters.

In your code you are using finetune parameter, which is, according to help doc “finetune from epoch n, rename the model before doing this”. I guess what you really want is to use a pretrained network and then fine-tune it. To do that you need to use pretrained parameter, which loads a base network in pretrained mode . Take a look into this line to see what it does https://github.com/apache/incubator-mxnet/blob/master/example/ssd/train/train_net.py#L222

So, I would recommend to change the parameter and see if it helped you (you probably want to do it without your code modifications, just to control the number of possible errors).


#3

Hey, thanks for the reply. I have tried this and unfortunately it does not seem to work. I also found this github issue from a while back which explains the use of the pretrained flag vs finetune flag.

This is the error I get when I use that flag

python train.py --network resnet50 --train-path data/train.rec --val-path data/test.rec --class-names text_block --num-class 1 --data-shape 512 --pretrained model/ssd_resnet50_512 --end-epoch 1000 --gpus 0 --val-list test
[22:28:18] src/io/iter_image_det_recordio.cc:281: ImageDetRecordIOParser: data/train.rec, use 11 threads for decoding…
[22:28:18] src/io/iter_image_det_recordio.cc:334: ImageDetRecordIOParser: data/train.rec, label padding width: 350
[22:28:19] src/io/iter_image_det_recordio.cc:281: ImageDetRecordIOParser: data/test.rec, use 11 threads for decoding…
[22:28:19] src/io/iter_image_det_recordio.cc:334: ImageDetRecordIOParser: data/test.rec, label padding width: 350
INFO:root:Start training with (gpu(0)) from pretrained model model/ssd_resnet50_512
[22:28:22] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while… (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Traceback (most recent call last):
File “train.py”, line 156, in
tensorboard=args.tensorboard)
File “/home/aschu/development/apache-incubator/mxnet/incubator-mxnet/example/ssd/train/train_net.py”, line 367, in train_net
monitor=monitor)
File “/home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/module/base_module.py”, line 464, in fit
allow_missing=allow_missing, force_init=force_init)
File “/home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/module/module.py”, line 308, in init_params
_impl(desc, arr, arg_params)
File “/home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/module/module.py”, line 296, in _impl
cache_arr.copyto(arr)
File “/home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/ndarray/ndarray.py”, line 1876, in copyto
return _internal._copyto(self, out=other)
File “”, line 25, in _copyto
File “/home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/_ctypes/ndarray.py”, line 92, in _imperative_invoke
ctypes.byref(out_stypes)))
File “/home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/base.py”, line 146, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [22:28:29] src/operator/nn/./…/tensor/…/elemwise_op_common.h:123: Check failed: assign(&dattr, (*vec)[i]) Incompatible attr in node at 0-th output: expected [84,128,3,3], got [8,128,3,3]

Stack trace returned 10 entries:
[bt] (0) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2a9e78) [0x7f41963afe78]
[bt] (1) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2aa288) [0x7f41963b0288]
[bt] (2) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x5106d6) [0x7f41966166d6]
[bt] (3) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x51192b) [0x7f419661792b]
[bt] (4) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x24bf0ea) [0x7f41985c50ea]
[bt] (5) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x24c1c19) [0x7f41985c7c19]
[bt] (6) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2403a7b) [0x7f4198509a7b]
[bt] (7) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/site-packages/mxnet/libmxnet.so(MXImperativeInvokeEx+0x63) [0x7f4198509fe3]
[bt] (8) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c) [0x7f41d98b7e20]
[bt] (9) /home/aschu/development/apache-incubator/mxnet/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb) [0x7f41d98b788b]

I received a similar error with the finetune flag before I modified train_net.py