I was participating MicroNet Challenge recently and the host (Google) has proposed a way to calculate FLOPs for float16 models. They think that, for float16 model, the **mult operations** in the matrix by matrix multiplication should be considered as float16 operations however the **add operations** should be float32.

From what Nvidia and MXNet tutorial claimed:

Nvidia: The Volta generation of GPUs introduces Tensor Cores, which provide 8x more throughput than single-precision math pipelines. Each Tensor Core performs D = A x B + C, where A, B, C, and D are matrices. A and B are half-precision 4x4 matrices, whereas D and C can be

either half or single precision 4x4 matrices.MXNet: Nvidia Tensor Cores essentially perform the computation D = A * B + C, where A and B are half-precision matrices, while C and D could be

either half-precision or full precision.

I am wondering how, in actual implementation, Gluon mixed-precision does **addition** for the float16 model? Half or full precision? Thanks!