Kvstore setting failed in fit()

I tried to set the kvstore for distributed training, but failed to work when we set kvstore as “dist_sync_device”. I launched the training with AWS batch equipped with multi nodes in single job. On single node, we set the kvstore as “device”, there is no issues. Moreover, we use the fit interface instead of “gluon”. Not sure why it got error at pickle.py? Is it because fit does not support kvstore = “dist_sync_device”?

com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.935Z kvstore.set_optimizer(self._optimizer)
24 Apr 2020 18:34:26,935 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.935Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/lib/python2.7/site-packages/mxnet/kvstore.py”, line 488, in set_optimizer
24 Apr 2020 18:34:26,935 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.935Z optim_str = py_str(pickle.dumps(optimizer, 0))
24 Apr 2020 18:34:26,935 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.935Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 1380, in dumps
24 Apr 2020 18:34:26,935 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.935Z Pickler(file, protocol).dump(obj)
24 Apr 2020 18:34:26,935 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.935Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 224, in dump
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z self.save(obj)
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 331, in save
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z self.save_reduce(obj=obj, *rv)
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 425, in save_reduce
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z save(state)
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 286, in save
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z f(self, obj) # Call unbound method with explicit self
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 655, in save_dict
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z self._batch_setitems(obj.iteritems())
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 669, in _batch_setitems
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z save(v)
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 331, in save
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z self.save_reduce(obj=obj, *rv)
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 425, in save_reduce
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z save(state)
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 286, in save
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z f(self, obj) # Call unbound method with explicit self
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 655, in save_dict
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z self._batch_setitems(obj.iteritems())
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 669, in _batch_setitems
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z save(v)
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 331, in save
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z self.save_reduce(obj=obj, *rv)
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 425, in save_reduce
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z save(state)
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 286, in save
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z f(self, obj) # Call unbound method with explicit self
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 655, in save_dict
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z self._batch_setitems(obj.iteritems())
24 Apr 2020 18:34:26,936 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.936Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 669, in _batch_setitems
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z save(v)
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 286, in save
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z f(self, obj) # Call unbound method with explicit self
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 606, in save_list
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z self._batch_appends(iter(obj))
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 621, in _batch_appends
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z save(x)
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 331, in save
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z self.save_reduce(obj=obj, *rv)
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 425, in save_reduce
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z save(state)
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 286, in save
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z f(self, obj) # Call unbound method with explicit self
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 655, in save_dict
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z self._batch_setitems(obj.iteritems())
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 669, in _batch_setitems
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z save(v)
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/pickle.py”, line 306, in save
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z rv = reduce(self.proto)
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z File “/opt/amazon/var/tmp/ihm-driver-working-directory8230582072918757545/c3339496b65d975eecf8be7385a89598/runner/python2.7/lib/python2.7/copy_reg.py”, line 70, in _reduce_ex
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z raise TypeError, “can’t pickle %s objects” % base.name
24 Apr 2020 18:34:26,937 [INFO] (DriverRunThread-0) com.amazon.ihm.driver.subprocess.output.ApplicationLogOutputWriter: Command Output: 2020-04-24T18:34:26.937Z TypeError: can’t pickle file objects