Hi guys, I’m trying to figure it out whether different branches of a layer can be executed in parallel. A simple example is googlenet. If 1-1 conv, 3-3 conv etc. are deployed in different devices, can they run in parallel? Since the input data needs to be copied to other devices, it is always sequential during my testing, (use mx.profiler for visualization), which shows the convolution is computed first, then the data is copied to other devices and computed there. I’m just curious is there any way to make it copy data first so that the convolution can be executed at the same time.
So with Gluon, you control when and where the data is sent, and the where the operations (e.g. convolutions) are initialized (i.e. where the weights/biases are stored). You should be able to get parallel execution across GPUs by sending the data to both GPUs, initializing the convolutions on different GPUs, and then applying the convolutions.
import mxnet as mx from mxnet.gluon import nn # batch size * channels * height * width data = mx.nd.array([[[[1,0], [0,1]]]]) ctx1 = mx.gpu(0) conv1 = nn.Conv2D(channels=3, kernel_size=(1,1)) conv1.initialize(ctx=ctx1) data1 = data.as_in_context(ctx1) out1 = conv1(data1) ctx2 = mx.gpu(1) conv2 = nn.Conv2D(channels=3, kernel_size=(1,1)) conv2.initialize(ctx=ctx2) data2 = data.as_in_context(ctx2) out2 = conv2(data2)
You should be aware of the transfer costs associated with moving data between GPUs though.