Dear all,
I am doing some trivial tests in order to be able to use ray for embarrassingly parallel distributed runs with mxnet. I am getting a weird error that I cannot understand, perhaps you could help. I’ve set up ray on an HPC gpu cluster at work (CSIRO - huuuuuuge thanks to the IT people!!!). For my work I just need to wrap a mxnet (gluon model) run in a function and assign it on a separate node (no communication with other nodes for now, no gpu exchange of data). The goal is to use this for bayesian hyperparameter optimization.
When I am using 2 or more nodes with this trivial example, everything is working:
import os
import sys
import ray
import time
# mxnet gpu examples
import mxnet as mx
from mxnet import nd
import numpy as np
@ray.remote(num_gpus = 4)
def f():
gpus = [int(x) for x in os.environ["CUDA_VISIBLE_DEVICES"].split(',')] # In case of multiple GPUs, comment out 2nd option.
tctx = [mx.gpu(i) for i in range(len(gpus))]
a = nd.random.uniform(shape=[3,4,16,16],ctx=tctx[0])
return a.asnumpy()
if __name__ == '__main__':
ray.init( redis_address = sys.argv[1])
result1 = ray.get(f.remote())
result2 = ray.get(f.remote())
print (result1,result2)
However, when I try to use any gluon object that derives from HybridBlock, for example:
@ray.remote(num_gpus=4)
def f(x):
loss = gluon.loss.L2Loss()
return x
I get the following error. It looks mxnet/gluon specific, but perhaps it has something to do with ray. By the way the error persists even if I use a single node - I do not have such a problem when I run my code outside of ray:
b082:6379
Remote function __main__.f failed with:
Traceback (most recent call last):
File "test_ray.py", line 30, in f
loss = gluon.loss.L2Loss()
File "/home/dia021/Software/mxnet/gluon/loss.py", line 129, in __init__
super(L2Loss, self).__init__(weight, batch_axis, **kwargs)
File "/home/dia021/Software/mxnet/gluon/loss.py", line 77, in __init__
super(Loss, self).__init__(**kwargs)
File "/home/dia021/Software/mxnet/gluon/block.py", line 693, in __init__
super(HybridBlock, self).__init__(prefix=prefix, params=params)
File "/home/dia021/Software/mxnet/gluon/block.py", line 172, in __init__
self._prefix, self._params = _BlockScope.create(prefix, params, self._alias())
File "/home/dia021/Software/mxnet/gluon/block.py", line 53, in create
prefix = _name.NameManager._current.value.get(None, hint) + '_'
AttributeError: '_thread._local' object has no attribute 'value'
You can inspect errors by running
ray.error_info()
If this driver is hanging, start a new one with
ray.init(redis_address="10.141.1.148:6379")
Traceback (most recent call last):
File "test_ray.py", line 75, in <module>
x1 = ray.get(feature1_id)
File "/home/dia021/Software/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 2321, in get
raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(4d221ae8ea6544ba46008c611f40f7ae24fb1f08). It was created by remote function __main__.f which failed with:
Remote function __main__.f failed with:
Traceback (most recent call last):
File "test_ray.py", line 30, in f
loss = gluon.loss.L2Loss()
File "/home/dia021/Software/mxnet/gluon/loss.py", line 129, in __init__
super(L2Loss, self).__init__(weight, batch_axis, **kwargs)
File "/home/dia021/Software/mxnet/gluon/loss.py", line 77, in __init__
super(Loss, self).__init__(**kwargs)
File "/home/dia021/Software/mxnet/gluon/block.py", line 693, in __init__
super(HybridBlock, self).__init__(prefix=prefix, params=params)
File "/home/dia021/Software/mxnet/gluon/block.py", line 172, in __init__
self._prefix, self._params = _BlockScope.create(prefix, params, self._alias())
File "/home/dia021/Software/mxnet/gluon/block.py", line 53, in create
prefix = _name.NameManager._current.value.get(None, hint) + '_'
AttributeError: '_thread._local' object has no attribute 'value'
Any pointers to where I may look for a solution? Thank you very much for your time.
Kind regards,
Foivos
edit: I found a similar error message reported in gluon-cv issues #156
edit2: I tried a similar example with pytorch instead of gluon and there are no problems. So this is a gluon specific thing. I will report it on github however if anyone knows of a hack-around to make it work please let me know. Thank you very much.
edit3: I’ve opened an issue #11331 The same problem appears with dask.distributed (instead of ray) as well.