I have a trained network (in fp32) and I want to optimize it for mobile device.
I tried to do int8 quantization using ncnn platform, it can bring 30% speedup. But it is not very impressive and it has to use floating point operations for the first and the last layer, otherwise the performance drop is massive. (By the way, will full int8 computation harm the performance so bad? the model size is around 20MB and I see similar sized model gives good full-int8 performance) So I’m now considering pruning my model.
I’ve gone through this forum and the only information about prune is that there is several pruned resnet model using gluon api. However, my model uses module api and it is not exactly a ResNet structure. So is there any guide for pruning a trained model (using module api) and then retrain it?
Moreover, what is the order of quantization and prune? Quantization first or prune first?
Any help or discussion is appreciated, Thanks.