I have some questions regarding quantization.
I studied this blog post: https://medium.com/apache-mxnet/model-quantization-for-production-level-neural-network-inference-f54462ebba05 and related code.
I was under the impression that
int8quantization was possible on CPU, however I am now finding out that only
uint8is possible on CPU, is that correct? Is there a plan to implement
int8with MKLDNN ?
When I perform
int8quantization and do inference on GPU, I get very similar results as the
fp32version of the model. When I perform
uint8quantization, my results are completely out of whack, even when I exclude every symbol except the convolutions. This is when I use
calibration='none'. How do I make sure that my models output similar results as the
fp32one when using
What is the role of
MXNET_SUBGRAPH_BACKEND=MKLDNN? what does this do? What does this do in relation with quantization? (edit: found this thanks to @anirudh2290 https://mxnet.incubator.apache.org/versions/master/tutorials/mkldnn/operator_list.html, I would suggest adding this in the quantization.py file to help people finding them)
What is the role of
sym_q.get_backend_symbol('MKLDNN_QUANTIZE')? What does this do ? I find it a bit confusing because my understanding is that MKLDNN_QUANTIZE actually does operator fusion and not quantization?
Am I correct to assume that even if some symbol appears to be split, they are actually fused? See https://cwiki.apache.org/confluence/display/MXNET/MXNet+Graph+Optimization+and+Quantization+based+on+subgraph+and+MKL-DNN
For example, here I ran the quantization step, and got the MKLDNN_QUANTIZE, but I feel this conv + batchnorm has not been fused? Is it normal?
- Just a suggestion, it would be great if the quantization was more Gluon friendly! I’ll modify the SymbolBlock to at least be able to load a quantized model from scratch. I think the calibration step could be implemented differently, for example it might be simpler to pass in a “Calibrator” instance to the quantization function that would hold all the calibration parameters and take care of running inference passes, and this Calibrator could be subclassed to support different types of model with different data types and iterator etc.
Thanks for your work on quantization!
symbol of resnet18 quantized: