Gluon with ray and dask.distributed problem

feevos · June 18, 2018, 1:15pm

Dear all,

I am doing some trivial tests in order to be able to use ray for embarrassingly parallel distributed runs with mxnet. I am getting a weird error that I cannot understand, perhaps you could help. I’ve set up ray on an HPC gpu cluster at work (CSIRO - huuuuuuge thanks to the IT people!!!). For my work I just need to wrap a mxnet (gluon model) run in a function and assign it on a separate node (no communication with other nodes for now, no gpu exchange of data). The goal is to use this for bayesian hyperparameter optimization.

When I am using 2 or more nodes with this trivial example, everything is working:

import os
import sys
import ray
import time

# mxnet gpu examples 
import mxnet as mx
from mxnet import nd
import numpy as np


@ray.remote(num_gpus = 4)
def f():

    gpus = [int(x) for x in os.environ["CUDA_VISIBLE_DEVICES"].split(',')] # In case of multiple GPUs, comment out 2nd option. 
    tctx = [mx.gpu(i) for i in range(len(gpus))]
    a = nd.random.uniform(shape=[3,4,16,16],ctx=tctx[0])


    return a.asnumpy()

if __name__ == '__main__':
    ray.init( redis_address =  sys.argv[1])
    result1 = ray.get(f.remote())
    result2 = ray.get(f.remote())

    print (result1,result2)

However, when I try to use any gluon object that derives from HybridBlock, for example:

@ray.remote(num_gpus=4)
def f(x):
    loss = gluon.loss.L2Loss()
    return x

I get the following error. It looks mxnet/gluon specific, but perhaps it has something to do with ray. By the way the error persists even if I use a single node - I do not have such a problem when I run my code outside of ray:

b082:6379
Remote function __main__.f failed with:
Traceback (most recent call last):
  File "test_ray.py", line 30, in f
    loss = gluon.loss.L2Loss()
  File "/home/dia021/Software/mxnet/gluon/loss.py", line 129, in __init__
    super(L2Loss, self).__init__(weight, batch_axis, **kwargs)
  File "/home/dia021/Software/mxnet/gluon/loss.py", line 77, in __init__
    super(Loss, self).__init__(**kwargs)
  File "/home/dia021/Software/mxnet/gluon/block.py", line 693, in __init__
    super(HybridBlock, self).__init__(prefix=prefix, params=params)
  File "/home/dia021/Software/mxnet/gluon/block.py", line 172, in __init__
    self._prefix, self._params = _BlockScope.create(prefix, params, self._alias())
  File "/home/dia021/Software/mxnet/gluon/block.py", line 53, in create
    prefix = _name.NameManager._current.value.get(None, hint) + '_'
AttributeError: '_thread._local' object has no attribute 'value'
  You can inspect errors by running
      ray.error_info()
  If this driver is hanging, start a new one with
      ray.init(redis_address="10.141.1.148:6379")
  
Traceback (most recent call last):
  File "test_ray.py", line 75, in <module>
    x1 = ray.get(feature1_id)
  File "/home/dia021/Software/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 2321, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(4d221ae8ea6544ba46008c611f40f7ae24fb1f08). It was created by remote function __main__.f which failed with:
Remote function __main__.f failed with:
Traceback (most recent call last):
  File "test_ray.py", line 30, in f
    loss = gluon.loss.L2Loss()
  File "/home/dia021/Software/mxnet/gluon/loss.py", line 129, in __init__
    super(L2Loss, self).__init__(weight, batch_axis, **kwargs)
  File "/home/dia021/Software/mxnet/gluon/loss.py", line 77, in __init__
    super(Loss, self).__init__(**kwargs)
  File "/home/dia021/Software/mxnet/gluon/block.py", line 693, in __init__
    super(HybridBlock, self).__init__(prefix=prefix, params=params)
  File "/home/dia021/Software/mxnet/gluon/block.py", line 172, in __init__
    self._prefix, self._params = _BlockScope.create(prefix, params, self._alias())
  File "/home/dia021/Software/mxnet/gluon/block.py", line 53, in create
    prefix = _name.NameManager._current.value.get(None, hint) + '_'
AttributeError: '_thread._local' object has no attribute 'value'

Any pointers to where I may look for a solution? Thank you very much for your time.

Kind regards,
Foivos

edit: I found a similar error message reported in gluon-cv issues #156

edit2: I tried a similar example with pytorch instead of gluon and there are no problems. So this is a gluon specific thing. I will report it on github however if anyone knows of a hack-around to make it work please let me know. Thank you very much.

edit3: I’ve opened an issue #11331 The same problem appears with dask.distributed (instead of ray) as well.

ThomasDelteil · June 19, 2018, 3:20am

Hey Foivos,

There is this (soon to be published) distributed training tutorial: https://github.com/indhub/mxnet/blob/e5b89cf9d7c35ac749ed14b54c0faa6dfffa15ef/example/distributed_training/README.md

Would that solve your issue?

reading the docs on Ray, I realize it is quite different from what you are looking for. You want to be able to schedule some MXNet runs from python on different machines?

edit: this seems related to the default prefix name. Can you try with: gluon.loss.L2Loss(prefix="test") ?

feevos · June 19, 2018, 3:35am

@ThomasDelteil you are amazing. Yes, the prefix="test" solved the problem. Thank you for the tutorial, I am currently trying to do hyperparam optimization, so I don’t need distributed kv store etc. I just want to find optimum LR, cycling LR params, batch size etc.

Thank you SO much for your help!!!

All the best,
Foivos

ThomasDelteil · June 19, 2018, 3:41am

Hey @feevos, happy it solved this issue but let me know if it actually works all the way running a proper network end-to-end. I have a feeling the issue is deeper as it looks like Gluon cannot get hold of the thread it is currently being executed on, and I suspect it could lead to other issues elsewhere, not simply in the parameter naming code.

feevos · June 19, 2018, 3:42am

Thanks, will do so. I’ll re-open the issue on github and let you guys close it when you feel is appropriate.

Again, many (many) thanks!!
Foivos

feevos · June 19, 2018, 4:02am

Hi @ThomasDelteil unfortunately the problem persists. When I use:


@ray.remote(num_gpus=4)
def f(x):



    mynet = gluon.nn.HybridSequential(prefix = "test")
    with mynet.name_scope():
        mynet.add(gluon.nn.Conv2D(32,kernel_size=3),prefix="test")
    # """

    #loss = gluon.loss.L2Loss(prefix="test")


 
    return x;

I get a very similar error:

Traceback (most recent call last):
  File "test_ray.py", line 75, in <module>
    x1 = ray.get(feature1_id)
  File "/home/dia021/Software/anaconda3/lib/python3.6/site-packages/ray/worker.py", line 2321, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(ba429ca1ad4ca769e8be11d3fb770876998fc9c1). It was created by remote function __main__.f which failed with:

Remote function __main__.f failed with:

Traceback (most recent call last):
  File "test_ray.py", line 26, in f
    mynet.add(gluon.nn.Conv2D(32,kernel_size=3),prefix="test2")
  File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 319, in __init__
    in_channels, activation, use_bias, weight_initializer, bias_initializer, **kwargs)
  File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 115, in __init__
    wshapes = _infer_weight_shape(op_name, dshape, self._kwargs)
  File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 37, in _infer_weight_shape
    sym = op(symbol.var('data', shape=data_shape), **kwargs)
  File "/home/dia021/Software/mxnet/symbol/symbol.py", line 2454, in var
    attr = AttrScope._current.value.get(attr)
AttributeError: '_thread._local' object has no attribute 'value'

Remote function __main__.f failed with:

Traceback (most recent call last):
  File "test_ray.py", line 26, in f
    mynet.add(gluon.nn.Conv2D(32,kernel_size=3),prefix="test2")
  File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 319, in __init__
    in_channels, activation, use_bias, weight_initializer, bias_initializer, **kwargs)
  File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 115, in __init__
    wshapes = _infer_weight_shape(op_name, dshape, self._kwargs)
  File "/home/dia021/Software/mxnet/gluon/nn/conv_layers.py", line 37, in _infer_weight_shape
    sym = op(symbol.var('data', shape=data_shape), **kwargs)
  File "/home/dia021/Software/mxnet/symbol/symbol.py", line 2454, in var
    attr = AttrScope._current.value.get(attr)
AttributeError: '_thread._local' object has no attribute 'value'

Many thanks!

ThomasDelteil · June 19, 2018, 4:54am

I will issue a PR to fix symbol.py:2454, though I am not sure of the full implication of the change.
For now you can add this on line 2454 of symbol.py:

    if not hasattr(AttrScope._current, "value"):
        AttrScope._current.value = AttrScope()

And it should work. I’m confident there’ll be a cleaner solution very soon.

@ray.remote
def f(x):
    net = mx.gluon.nn.Dense(2, prefix="test")
    net.initialize()
    y = net(mx.nd.array(x))
    return y.asnumpy()
ray.get(f.remote([1,2,3]))

array([[-0.0196689 ,  0.01582889],
       [-0.03933779,  0.03165777],
       [-0.05900669,  0.04748666]], dtype=float32)

edit: see PR here: https://github.com/apache/incubator-mxnet/pull/11332

feevos · June 19, 2018, 6:00am

Thank you @ThomasDelteil , this solves the problem (at least in the first tests). I will try to run large models and get back here. Again, many thanks!!!

edit: I just tried a fairly complicated network and it passed, all good, thanks!

Topic		Replies	Views
Correct way to train Sequential() model on GPU Gluon	6	1131	February 10, 2021
Not implemented for GPU MXNet Model Server	5	3018	May 28, 2019
Not implemented for use with GPUs Gluon	4	4062	March 19, 2019
Documentation Request: Model Parallelism Tutorial Performance	6	1841	March 10, 2018
Parallel execution on GoogLeNet Discussion	1	460	March 30, 2018

Gluon with ray and dask.distributed problem

Related Topics