`MXImperativeInvokeEx` is taking a long time

hyesun · December 31, 2018, 3:50pm

Hey guys. As shown in the image, MXImperativeInvokeEx is taking a long time. I wonder what it probably does.
I use the profiler to profile my whole program, from data loading to gradient updating.
Also there are blank stages in between the processes. I suspect that’s the data loading process.
Sorry I may not have described my problem concretely, since I’m new to MXNet. This program is originally written in PyTorch and recently I rewrite it using MXNet gluon (also using HybridBlock), with most things remain the same. But the PyTorch version is 10 times faster than this MXNet version. I’m going nowhere for the solution.
I’m eager to find someplace where I can chat with people instantly about the problem so that I can give the details.

NRauschmayr · December 31, 2018, 5:18pm

Can you send a reproducible example? Which version of MXNet, cuda, cudnn and which OS are you running on?
One thing to keep in mind is, that MXNet does some optizmiation in the beginning that can take some time. You can enable/disable it by setting MXNET_CUDNN_AUTOTUNE_DEFAULT.

hyesun · January 1, 2019, 5:28am

Yes. If you don’t mind, this is my repository.
project
The related part is located in mx_hico/roi_mil.
And it uses a wrapper I write.
wrapper
The related part is located in mx_wrapper/mx_wrapper.
Since I don’t know which part causes the problem, I can’t give a tiny reproducible sample and I’m sorry for that.
I’ve turned MXNET_CUDNN_AUTOTUNE_DEFAULT off. I’m using MXNET 1.3.1, cuda 8.0, cudnn 5.1.3 on Ubuntu 14.04 with Titan X.
Thank you for you reply.

hyesun · January 1, 2019, 5:29am

Thank you for your reply

hyesun · January 2, 2019, 5:04pm

I tried with a simple sample to reflect the problem.
For pytorch:

import torch
from torch import nn
import time
import torchvision.models as models

resnet = models.resnet50(pretrained=True)
resnet.cuda(0)
resnet = nn.DataParallel(resnet, [0, 1], output_device=0)
data = torch.ones(8, 3, 224, 224)
data = data.cuda(0)
resnet(data)
tick = time.time()
resnet(data)
print("{0:.4f}".format(time.time()-tick))

For mxnet:

import mxnet as mx
from mxnet import autograd, gluon, nd
import time
from gluoncv import model_zoo

ctx = [mx.gpu(0), mx.gpu(1)]
resnet = model_zoo.resnet50_v1(pretrained=True, ctx=ctx)
resnet.hybridize()
data = mx.nd.ones([8, 3, 224, 224])
splitted = gluon.utils.split_and_load(data, ctx_list=ctx)
for _data in splitted:
    resnet(_data).wait_to_read()
tick = time.time()
with autograd.record():
    for _data in splitted:
        resnet(_data)

nd.waitall()
print("{0:.4f}".format(time.time()-tick))

And the result is:

pytorch: ~0.03s
mxnet: ~0.06s

NRauschmayr · January 2, 2019, 8:40pm

I tried to reproduce the performance numbers: One problem I see in your example is that you are not iterating over multiple examples. This means your GPU is likely under-utilized. I would suggest to have a warm-up phase of 10 iterations and then the main benchmark loop with a 100 or more iterations. I run your example with these modifications and MXNet was then slightly faster.

Thanks a lot for sharing the link to your repository. I will have a look on the code and see if I can find the problem, why MXImperativeInvokeEx is taking a long time.

NRauschmayr · January 3, 2019, 6:41pm

I tried to run your code https://git.dev.tencent.com/hyesun1832/mx_hico.git but it is missing the input dataset. Where can I find the dataset?

hyesun · January 4, 2019, 4:24am

Yes, you can download it here:
data
Just place it in the DATA_DIR that your config file specifies.

hyesun · January 6, 2019, 11:36am

Hey man. Seems that I’ve found the problem. In my loss computation, I need to apply weighting. And the weight is generated dynamically according to the labels. So I generate the weight in the loss function, which causes the speed lag.
I modify my pipeline, where the weight generation happens in the dataset, so that I can utilize the multiprocess worker loop.
I don’t know whether this has something to do with the MXImperativeInvokeEx, but the modification does speed up my program.
Further more I updated my cuda (to 9.2) and cudnn (to 7), and this also seems to help.
Thanks for your help.

Topic		Replies	Views
Training speed in MXNet is nearly 2.5x times slower than Pytorch	8	2976	January 20, 2019
The complied mxnet from latest master branch is slower than pip installed version?	1	500	October 19, 2017
Hybrid training speed is 20% slower than pytorch Performance	5	1326	January 11, 2019
Mxnet forward operation on the first batch is very slow	4	893	October 19, 2017
Very slow initialisation of GPU distributed training Gluon	7	1291	September 7, 2020

`MXImperativeInvokeEx` is taking a long time

Related Topics