MXNet Forum

Error Running Examples related to CNN


#1

I recently set up a new computer and I can’t run my code on this computer now. Every time, I do the calculation of loss and back propagation, I encounter this error.

‘’’
KernelRestarter: restarting kernel (4/5), keep random ports
kernel c795ac26-b3b1-4c3c-94fe-34cd67a934a4 restarted
Traceback (most recent call last):
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py”, line 16, in
app.launch_new_instance()
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/site-packages/traitlets/config/application.py”, line 657, in launch_instance
app.initialize(argv)
File “”, line 2, in initialize
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/site-packages/traitlets/config/application.py”, line 87, in catch_config_error
return method(app, *args, **kwargs)
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/site-packages/ipykernel/kernelapp.py”, line 467, in initialize
self.init_sockets()
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/site-packages/ipykernel/kernelapp.py”, line 239, in init_sockets
self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
File “/home/tianweiy/anaconda3/envs/py36/lib/python3.6/site-packages/ipykernel/kernelapp.py”, line 181, in _bind_socket
s.bind(“tcp://%s:%i” % (self.ip, port))
File “zmq/backend/cython/socket.pyx”, line 547, in zmq.backend.cython.socket.Socket.bind
File “zmq/backend/cython/checkrc.pxd”, line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use
‘’’
When I finish running the code, I get another error
‘’’
Segmentation fault: 11

Stack trace returned 4 entries:
[bt] (0) /home/tianweiy/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x382eea) [0x7f4c8261eeea]
[bt] (1) /home/tianweiy/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x31a3d76) [0x7f4c8543fd76]
[bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20) [0x7f4cee389f20]
[bt] (3) [0x55c700a80e20]

‘’’

My code is just basic CNN example, I use jupyter notebook to debug and I found that every time the program runs
‘’’

AutoGrad

    with ag.record():
        output = [net(X) for X in data]
        loss = [loss_fn(yhat, y) for yhat, y in zip(output, label)]

    # Backpropagation
    for l in loss:
        l.backward()

‘’’
the kernel fail and I get the error message above.

Moreover, I try the MXNet mnist example

‘’’
label = gluon.utils.split_and_load(batch.label[0], ctx_list=ctx, batch_axis=0)
outputs =
with ag.record():
for x, y in zip(data, label):
print(1)
z = net(x) # this line cause the error
‘’’

System Setup.
ubuntu 18.04
cuda 9.2, cuda 9.1, cuda 10.0 installed(I activate the use of cuda 9.2 by creating an env file with path to specific cuda version)
cudnn v7.3.1 for linux
I have tested my cuda and cudnn using nvidia samples.

GPU: RTX 2080
CPU is AMD Ryzen 2600
RAM: 16GB

I think memory and cpu use is not the cause as the memory use is only 50% and I set the batch size and num_worker to vey small value but the kernel still failed.


#2

Have you tried running one of the test scripts on the gluon website?


#3

I haven’t tried the train_mnist example from the mxnet root directory. But my first example just copies code from this script. I think this probably has something to do with the support of 20 series driver, 18.04 ubuntu version or my cuda and cudnn. I think an reinstallation of ubuntu 16.04 and cuda/cudnn may solve the problem. But as I have several urgent projects and pytorch works fine now, I will probably wait until mxnet officially support cuda 10