You can check your network architecture like that:
________________________________________________________________________________________________________________________ Layer (type) Output Shape Param # Previous Layer ======================================================================================================================== data(null) 0 ________________________________________________________________________________________________________________________ conv1_1(Convolution) 64 data ________________________________________________________________________________________________________________________ relu1_1(Activation) 0 conv1_1 ________________________________________________________________________________________________________________________ .... ________________________________________________________________________________________________________________________ fc7(FullyConnected) 4096 drop6 ________________________________________________________________________________________________________________________ relu7(Activation) 0 fc7 ________________________________________________________________________________________________________________________ drop7(Dropout) 0 relu7 ________________________________________________________________________________________________________________________ fc8(FullyConnected) 1000 drop7 ________________________________________________________________________________________________________________________ prob(SoftmaxOutput) 0 fc8 ======================================================================================================================== Total params: 13416 ________________________________________________________________________________________________________________________
You want to use
relu7 for your fine-tuning layer.
(new_sym, new_args) = get_fine_tune_model(sym, arg_params, num_classes, layer_name='relu7')
2018-05-28 18:57:46,908 Epoch Batch  Speed: 441.92 samples/sec accuracy=0.884375 2018-05-28 18:57:48,405 Epoch Batch  Speed: 427.60 samples/sec accuracy=0.934375 2018-05-28 18:57:48,406 Epoch Train-accuracy=0.934375 2018-05-28 18:57:48,406 Epoch Time cost=35.405 2018-05-28 18:58:00,677 Epoch Validation-accuracy=0.725211
Yes! Thanks Tom!
Using relu7 with sgd and lr=0.001 works.
But I am curious why this happens?
I’ve add a dense layer (to num_classes). Let former network be F.
I believe that there are 3 dense layers : fc6, fc7 and fc8
relu just adds some no-linear transformation, so relu7_ouput = relu(F(x).fc7_output)
and then final_output = softmax(W*relu7_output+b)
just using fc7, the final_output = softmax(W*fc7_output+b)
what I believe is that : just using fc7, the final 2 layers can be seen as a BIG W， so it should work.
But the fact is using relu7 works
Previously you were using flatten0 which is before any dense layers.
Then you have fc6 and fc7 which are fully connected layers with 4096 hidden units each
fc8 is actually the classification layer which has 1000 (as the number of classes in ImageNet 1k) units.
If you use flatten0, and then add your own classification layer, you are not using fc6 and fc7. For example just looking at fc7, it has ~16M parameters. So you are throwing a lot of pre-trained information there if you use flatten0 rather than relu7.
Thanks for the explaination.
So now my understanding is that the right way to fine-tune is to only replace the last classification dense layer(fc8).
If using flatten0 or fc6 or fc7, the new-added layer doesn’t have that much ability to transform the information from the upper(or former) layers to minor(compare to 1k) classes. Tested with relu6, bad performance.
So I believe that VGG team should have tested how many dense layers and neuros would work best (at least for ImageNet). (I haven’t seen any explaination about how to and why choose the last 3 dense layers in the original paper:https://arxiv.org/pdf/1409.1556.pdf. Maybe I just didn’t get it)