Distributed Training (Permission Denied)

Hi, I am trying to run distributed training the example provided on 2 different node with the following internal IP address:

user1@111.111.111.121
user2@111.111.111.122

I created a hosts file with the following ip, and ssh got no issue at all from one to other machine. When i launch the code:

python ../../tools/launch.py -n 2 --launcher ssh -H hosts python train_mnist.py --network lenet --kv-store dist_device_sync

And it prompt the following output at the same time:

user1@111.111.111.121's password: user1@111.111.111.121's password: user2@111.111.111.122's password: user2@111.111.111.122's password:

For both machine I’m using the same admin password, no matter how hard I try it just prom Permission denied, please try again.

It’s there any way I can get debug message on what really happening behind the background?

For distributed training to work, it should be possible to ssh into the machines in the hosts file without providing any authentication in the command line. One way to do this is to specify the ssh certificates in ~/.ssh/config. Example:

~$ cat ~/.ssh/config 
Host d1
    HostName ec2-111-111-111-121.compute-1.amazonaws.com
    port 22
    user ubuntu
    IdentityFile /home/ubuntu/test.pem
    IdentitiesOnly yes

Host d2
    HostName ec2-111-111-111-122.compute-1.amazonaws.com
    port 22
    user ubuntu
    IdentityFile /home/ubuntu/test.pem
    IdentitiesOnly yes

With the above configuration, you can just have the hostnames (d1 and d2) in the hosts file.

>cat hosts
d1
d2
1 Like

Hi, do you solved the problem of Permission denied? I’m meet the same problem. Could you help me, please.