Updating mxnet from 1.0.0, networks give different outputs

I am working in a production environment, where have some networks implemented in mxnet 1.0.0. I am working updating our systems and trying to push to the latest mxnet (1.4.x as of now) but when we upgrade our networks produce different outputs.
We are using symbols saved to json files and arg/aux_params stored in .params files. these were all produced by mxnet 1.0.0 or earlier.

When using the latest mxnet (or 1.4.x) we are getting different outputs for the same inputs, with our saved models. I have been trying to use git bisect or slowly upgrading versions to figure out where this breaking change occurred but there are issues with your git history and/or some strange (compiler??) incompatibilities which prevent getting a clean checkout/build for nearly all of the intermediate versions…

Does anyone have any idea what could be causing this?

EDIT: FYI these are VERY different outputs, im rounding to about the 5 decimal point, or higher, so these aren’t numerical differences.

And we are using modules, not gluons (part of the point is upgrade and move towards gluons)

Hi,

To better understand the nature of the issue, I have few questions:

  • What API are you using for inference?
  • During inference are you doing any preprocessing steps on your data before passing it into the model?
  • What kind of inputs are you passing into the model? Images?
  • What kind of network is it? Is it a popular network like resnet or a custom designed network?
  • Can you inspect the model weights when loaded in 1.4.x vs 1.0.0. Are the weights the same?
  • What is the magnitude of difference (or percentage difference) between the model predictions across both versions?

Trying to figure out where there was a change by going through the revision history of the repo would be a arduous task. But would be insightful to know if this issue exists in previous stable MXNet release versions i.e 1.2.1, 1.3.x, etc.

To better understand the nature of the issue, I have few questions:

  • What API are you using for inference?
    Python, module
  • During inference are you doing any preprocessing steps on your data before passing it into the model?
    Yes, but the input to the network is exactly the same, this is part of a test suite.
  • What kind of inputs are you passing into the model? Images?
    Images
  • What kind of network is it? Is it a popular network like resnet or a custom designed network?
    It’s custom, I can give you a break down of what’s in it
  • Can you inspect the model weights when loaded in 1.4.x vs 1.0.0. Are the weights the same?
    I will look into this
  • What is the magnitude of difference (or percentage difference) between the model predictions across both versions?
    There’s no consistent magnitude, but it’s between 1e0 and 1e-5

Trying to figure out where there was a change by going through the revision history of the repo would be a arduous task. But would be insightful to know if this issue exists in previous stable MXNet release versions i.e 1.2.1, 1.3.x, etc.

I have done this, I cannot get any of them to compile, there are issues with git commits that missing from the submodules, as well as strange compiler issues that seem to effect all the releases between 1.0 and 1.4…

To better understand the nature of the issue, I have few questions:

  • What API are you using for inference?
    Python, module
  • During inference are you doing any preprocessing steps on your data before passing it into the model?
    Yes, but the input to the network is exactly the same, this is part of a test suite.
  • What kind of inputs are you passing into the model? Images?
    Images
  • What kind of network is it? Is it a popular network like resnet or a custom designed network?
    It’s custom, I can give you a break down of what’s in it
  • Can you inspect the model weights when loaded in 1.4.x vs 1.0.0. Are the weights the same?
    I will look into this
  • What is the magnitude of difference (or percentage difference) between the model predictions across both versions?
    There’s no consistent magnitude, but it’s between 1e0 and 1e-5

Trying to figure out where there was a change by going through the revision history of the repo would be a arduous task. But would be insightful to know if this issue exists in previous stable MXNet release versions i.e 1.2.1, 1.3.x, etc.d

I have done this, I cannot get any of them to compile, there are issues with git commits that missing from the submodules, as well as strange compiler issues that seem to effect all the releases between 1.0 and 1.4…

Edit making responses clear

Thanks for responding.

Are you installing from source? Otherwise it should be straightforward to run pip install mxnet==1.2.1 for example to get v1.2.1

Also, sorry to press on this, but are the pixel values in the input ndarray exactly identical across the two different versions?