Using mxnet on GPU Cluster , with SLRUM

I am having trouble using MXNet on SLRUM cluster.

So I have these models loaded:

$ sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq         up   infinite      0    n/a 
NV100q       up   infinite      2  alloc node[07-08]
PV100q       up   infinite      1  alloc node09
K20q         up   infinite      3   idle node[01-03]
K80q         up   infinite      2  alloc node[05-06]
RTXq         up   infinite      2  alloc node[10-11]
RTXq         up   infinite      2   idle node[13-14]
RTX1q        up   infinite      1   idle node12
PV1002q      up   infinite      2  alloc node[16-17]
GV1002q      up   infinite      1  alloc node15
PP1003q      up   infinite      1  alloc node04
DGXq         up   infinite      2  alloc node[18-19]

$ module list
Currently Loaded Modulefiles:
 1) slurm/17.11.5         3) cuda90/fft/9.0.176       5) cuda90/nsight/9.0.176    
 2) cuda90/blas/9.0.176   4) cuda90/toolkit/9.0.176   6) cuda90/profiler/9.0.176  

I used pip to install mxnet-cu90
If I start python interpreter on terminal

>>> import mxnet as mx
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kaivalya/env/lib/python3.5/site-packages/mxnet/__init__.py", line 24, in <module>
    from .context import Context, current_context, cpu, gpu, cpu_pinned
  File "/home/kaivalya/env/lib/python3.5/site-packages/mxnet/context.py", line 24, in <module>
    from .base import classproperty, with_metaclass, _MXClassPropertyMetaClass
  File "/home/kaivalya/env/lib/python3.5/site-packages/mxnet/base.py", line 213, in <module>
    _LIB = _load_lib()
  File "/home/kaivalya/env/lib/python3.5/site-packages/mxnet/base.py", line 204, in _load_lib
    lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_LOCAL)
  File "/usr/lib/python3.5/ctypes/__init__.py", line 347, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcuda.so.1: cannot open shared object file: No such file or directory

So I got to know that I need to send it as a script using sbatch to which I did the following

testjob.sh is

#!/bin/sh

#SBATCH -o gpu-job-%j.output
#SBATCH -p K20q
#SBATCH --gres=gpu:1
#SBATCH -n 1 

# module load cuda90/toolkit
module load cuda90/blas/9.0.176
module load cuda90/nsight/9.0.176
module load cuda90/profiler/9.0.176
module load cuda90/toolkit/9.0.176
module load cuda90/fft/9.0.176


/cm/shared/apps/cuda90/sdk/9.0.176/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS

python test.py

test.py is

import mxnet as mx

ctx = mx.gpu()

print("Working")

Then for the sbatch command

sbatch testjob.sh 

gives :

/cm/local/apps/slurm/var/spool/job99162/slurm_script: line 15:  1829 Illegal instruction     (core dumped) python test.py
GPU Device 0: "Tesla K20Xm" with compute capability 3.5

simpleCUBLAS test running..
simpleCUBLAS test passed.

Can someone help me ?

Hi,

I have never used slurm, but from the error:

OSError: libcuda.so.1: cannot open shared object file: No such file or directory

it looks like it puts the libcuda.so.1 file somewhere where the dynamic library loader can’t find it.

The output of:

ldd env/lib/python3.5/site-packages/mxnet/libmxnet.so | grep libcuda.so.1
	libcuda.so.1 => /usr/lib/x86_64-linux-gnu/libcuda.so.1 (0x00007f1a6359e000)

should give you the path where the libcuda.so.1 library is expected, but on your system the file will not be there.

You can add the folder where slurm installed libcuda.so.1 on your system in the LD_LIBRARY_PATH environment variable.

hth,

Lieven

1 Like

In slrum I don’t know how to add file paths.
Available modules can be seen by

module avail

I have these modules which can be added and it is expected to link the concerned libraries.
I was able to import tensorflow with sbatch but not mxnet !

@feevos, do you have any experience with SLURM that could help @kaivu1999 ?

1 Like

Hi @kaivu1999 (thanks @ThomasDelteil - just saw this).

I am also running mxnet under SLURM cluster manager. In my case I have a dedicated environment set up for my account (locally). I first install anaconda, with all python env definitions (so I do not load modules for my work - besides the cuda drivers). With these modules list:

Currently Loaded Modulefiles:
  1) SC                    3) cuda-driver/current   5) intel-fc/16.0.4.258   7) texlive/2015
  2) slurm/current         4) intel-cc/16.0.4.258   6) cuda/9.2.88

and a local python and mxnet-cu92 installation (assuming you are in the directory location for the installation)

pip install -t . mxnet-cu90==1.5.0b20190703

(before the mxnet installation I load the cuda drivers - for peace of mind - I don’t know if that makes a difference), mxnet works fine in interactive nodes (or login terminal). If you want to submit jobs, you need to make sure that all nodes can see the local mxnet installation you’ve done. This is my .bashrc file:

export PYTHONPATH=$PYTHONPATH:/scratch1/dia021/Software #CUSTOM installation
module load cuda/9.2.88 # This (is)  my default for GPU computing.

# added by Anaconda3 installer
alias ipylab='ipython --pylab'

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/scratch1/dia021/Software/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/scratch1/dia021/Software/anaconda3/etc/profile.d/conda.sh" ]; then
        . "/scratch1/dia021/Software/anaconda3/etc/profile.d/conda.sh"
    else
        export PATH="/scratch1/dia021/Software/anaconda3/bin:$PATH"
    fi
fi
unset __conda_setup
# <<< conda initialize <<<

This is an example of single node job submission (consuming 4 gpus):

#!/bin/bash -l

#SBATCH --job-name="RunA"
#SBATCH -t 23:30:30
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:4
#SBATCH --mem=16gb


export  PYTHONUNBUFFERED=1

python main.py

A way to add file paths is through the python environment in the main file to run, so all nodes can see them is:

import sys
sys.path.append(r'/YourLocal/Software/Installation/')

Also, you need to load specific modules, as you do, in the job.sh submit file, prior running your executable.

Hope this helps.

1 Like

What is the purpose of this line? Incidentally, it is line 16 in your example so very close to line 15 that the error is given - maybe this is the problem?

Another thing to test mxnet with is to open an interactive python environment and see if mxnet works

sinteractive -g gpu:1

or

salloc --ntasks-per-node 1 -J interactive -t 2:00:00 srun --pty /bin/bash -l -g gpu:1

then

ipython

then

import mxnet as mx
ctx = mx.gpu()
1 Like

Thanks @feevos for the detailed reply. I am trying it out and will let you know !
Can you tell me how did you build mxnet from source ?
I am not able to run
nvcc or nvidia-smi on the shell
Also i don’t have sinteractive command somwhow, I don’t know why
And,

salloc --ntasks-per-node 1 -J interactive -t 2:00:00 srun -p K20q -n 1 --gres=gpu:1

says :

 salloc: error: Job submit/allocate failed: No partition specified or system default partition

Hi @kaivu1999,

I did not, I installed via pip as described above.

It seems that sinteractive is an alias for srun commands in our local system, try this (worked for me):

srun -N 1 -n 1 --gres=gpu:1 --pty bash -i

you can also try it without gpu to test mxnet installation with cpu:

srun -N 1 -n 1  --pty bash -i

I do not know why you are not able to run nvidia-smi, in my case it works even when the cuda drivers are not loaded. nvcc works only if I load the corresponding cuda module. I think you should try and contact your local system administrator, as it seems your environment has custom definitions.

Hi @feevos,
Thanks for the reply.

sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq         up   infinite      0    n/a 
NV100q       up   infinite      2  alloc node[07-08]
PV100q       up   infinite      1  alloc node09
K20q         up   infinite      3  alloc node[01-03]
K80q         up   infinite      2  alloc node[05-06]
RTXq         up   infinite      4  alloc node[10-12,14]
RTXq         up   infinite      1   idle node13
PV1002q      up   infinite      2  alloc node[16-17]
GV1002q      up   infinite      1  alloc node15
PP1003q      up   infinite      1  alloc node04
DGXq         up   infinite      2  alloc node[18-19]

and on running srun

srun -p RTXq -n 1 --gres=gpu:1 --pty bash -i

with or without gpu, gives the same error

ERROR: Unable to locate a modulefile for 'cuda90/profiler/9.0.17'

But as mentioned earlier I have cuda90 modules:

module list
Currently Loaded Modulefiles:
 1) slurm/17.11.5         3) cuda90/nsight/9.0.176   5) cuda90/profiler/9.0.176  
 2) cuda90/blas/9.0.176   4) cuda90/fft/9.0.176      6) cuda90/toolkit/9.0.176

I don’t understand why?

Hi @kaivu1999, I am afraid I am out of my depth here. Message from my favorite sys admin (thanks Ond!):

Try this (exactly as it is)

salloc --ntasks-per-node 1 -p K20q -J interactive -t 2:00:00 --gres gpu:1 srun --pty /bin/bash -l

What system are you using? Any additional info? Cheers
edit: try a complete clean new shell environment (logout/login) before attempting the above.

Thank you very much @feevos and Ond:

I don’t understand what to do !
So now,
On the exact command that you mentioned with RTXq as someone is using K20q now :frowning: I am getting :

 salloc: Granted job allocation 139248
srun: Step created for job 139248
ERROR: Unable to locate a modulefile for 'cuda90/profiler/9.0.17'

I have loaded module ‘cuda90/profiler/9.0.176’ and it says ‘cuda90/profiler/9.0.17’ (unable to locate message).

Tried on a new shell as well nut no success !!

1 Like

OK, this looks super weird. Getting an allocation node (with gpu resources) is completely independent of the loaded modules, which is something you do AFTER the node is allocated. Therefore, it doesn’t make sense to get this error - that relates to cuda driver - after the allocation command was successful. In our system, I get node allocation and then I choose which cuda driver I want to load (with multiple options for the same hardware).

Can you please try again to start the interactive node, and once succesfull, type module avail cuda and paste here the result?

Hi I don’t how, but now it worked !!

kaivalya@scsegc:~$ salloc --ntasks-per-node 1 -p K80q -J interactive -t 2:00:00 --gres gpu:1 srun --pty /bin/bash -l
salloc: Granted job allocation 139552
srun: Step created for job 139552
ERROR: Unable to locate a modulefile for 'cuda90/profiler/9.0.17'
kaivalya@node05:~$ source env/bin/activate
(env) kaivalya@node05:~$ python
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.kaivalya@node05:~$ source env/bin/activate
>>> import mxnet as mx
>>> mx.cpu()
cpu(0)
>>> mx.gpu()
gpu(0)
>>> 

But still the error shows up !! regarding the module.
module availe cuda gives

 module avail cuda
--------------------------------------- /cm/local/modulefiles ---------------------------------------
cuda-dcgm/1.3.3.1  cuda-test/test  

-------------------------------------- /cm/shared/modulefiles ---------------------------------------
cuda90/blas/9.0.176    cuda90/profiler/9.0.176  cuda91/fft/9.1.85       cuda91/toolkit/9.1.85  
cuda90/fft/9.0.176     cuda90/toolkit/9.0.176   cuda91/nsight/9.1.85    
cuda90/nsight/9.0.176  cuda91/blas/9.1.85       cuda91/profiler/9.1.85
2 Likes