Updating mxnet from 1.0.0, networks give different outputs

jmerkow · March 13, 2019, 6:31pm

I am working in a production environment, where have some networks implemented in mxnet 1.0.0. I am working updating our systems and trying to push to the latest mxnet (1.4.x as of now) but when we upgrade our networks produce different outputs.
We are using symbols saved to json files and arg/aux_params stored in .params files. these were all produced by mxnet 1.0.0 or earlier.

When using the latest mxnet (or 1.4.x) we are getting different outputs for the same inputs, with our saved models. I have been trying to use git bisect or slowly upgrading versions to figure out where this breaking change occurred but there are issues with your git history and/or some strange (compiler??) incompatibilities which prevent getting a clean checkout/build for nearly all of the intermediate versions…

Does anyone have any idea what could be causing this?

EDIT: FYI these are VERY different outputs, im rounding to about the 5 decimal point, or higher, so these aren’t numerical differences.

And we are using modules, not gluons (part of the point is upgrade and move towards gluons)

sad · March 13, 2019, 11:19pm

Hi,

To better understand the nature of the issue, I have few questions:

What API are you using for inference?
During inference are you doing any preprocessing steps on your data before passing it into the model?
What kind of inputs are you passing into the model? Images?
What kind of network is it? Is it a popular network like resnet or a custom designed network?
Can you inspect the model weights when loaded in 1.4.x vs 1.0.0. Are the weights the same?
What is the magnitude of difference (or percentage difference) between the model predictions across both versions?

Trying to figure out where there was a change by going through the revision history of the repo would be a arduous task. But would be insightful to know if this issue exists in previous stable MXNet release versions i.e 1.2.1, 1.3.x, etc.

jmerkow · March 13, 2019, 11:39pm

To better understand the nature of the issue, I have few questions:

What API are you using for inference?
Python, module
During inference are you doing any preprocessing steps on your data before passing it into the model?
Yes, but the input to the network is exactly the same, this is part of a test suite.
What kind of inputs are you passing into the model? Images?
Images
What kind of network is it? Is it a popular network like resnet or a custom designed network?
It’s custom, I can give you a break down of what’s in it
Can you inspect the model weights when loaded in 1.4.x vs 1.0.0. Are the weights the same?
I will look into this
What is the magnitude of difference (or percentage difference) between the model predictions across both versions?
There’s no consistent magnitude, but it’s between 1e0 and 1e-5

Trying to figure out where there was a change by going through the revision history of the repo would be a arduous task. But would be insightful to know if this issue exists in previous stable MXNet release versions i.e 1.2.1, 1.3.x, etc.

I have done this, I cannot get any of them to compile, there are issues with git commits that missing from the submodules, as well as strange compiler issues that seem to effect all the releases between 1.0 and 1.4…

jmerkow · March 13, 2019, 11:43pm

To better understand the nature of the issue, I have few questions:

What API are you using for inference?
Python, module
During inference are you doing any preprocessing steps on your data before passing it into the model?
Yes, but the input to the network is exactly the same, this is part of a test suite.
What kind of inputs are you passing into the model? Images?
Images
What kind of network is it? Is it a popular network like resnet or a custom designed network?
It’s custom, I can give you a break down of what’s in it
Can you inspect the model weights when loaded in 1.4.x vs 1.0.0. Are the weights the same?
I will look into this
What is the magnitude of difference (or percentage difference) between the model predictions across both versions?
There’s no consistent magnitude, but it’s between 1e0 and 1e-5

Trying to figure out where there was a change by going through the revision history of the repo would be a arduous task. But would be insightful to know if this issue exists in previous stable MXNet release versions i.e 1.2.1, 1.3.x, etc.d

I have done this, I cannot get any of them to compile, there are issues with git commits that missing from the submodules, as well as strange compiler issues that seem to effect all the releases between 1.0 and 1.4…

Edit making responses clear

sad · March 13, 2019, 11:58pm

Thanks for responding.

Are you installing from source? Otherwise it should be straightforward to run pip install mxnet==1.2.1 for example to get v1.2.1

Also, sorry to press on this, but are the pixel values in the input ndarray exactly identical across the two different versions?

Topic		Replies	Views
Reproduce results with different MXNET versions? Discussion	3	495	August 21, 2018
Mxnet-tensorrt result different Discussion	5	752	November 6, 2018
Running inference with varying input size	3	1139	October 20, 2019
Pretrained model with different number of output	0	297	January 3, 2020
Python experimentation and Java/Scala production systems Gluon java , scala , best-practices , how-to	5	708	February 13, 2019

Updating mxnet from 1.0.0, networks give different outputs

Related Topics