Slow speed in multi-GPU data loading

I found copying data from CPU to multiple GPU very slow using split_and_load. In experiment below, it seems that loading ~420MB of data takes 12.6 seconds. This is a EC2 P3 8Xlarge box with 4 GPUs, 32CPUs, and 240GB memory.

Is this a problem with the API or am I using the function incorrectly?

Thanks!

import time
from mxnet import nd, autograd, gluon, init, gpu, cpu
from mxnet.gluon.utils import split_and_load

batch_size = 1024
data = nd.random.uniform(shape=(25600, 32, 128))
devices = [gpu(0), gpu(1), gpu(2), gpu(3)]

step = int(batch_size/(len(devices)))
start = list(range(0, batch_size, step))
end = [s+step for s in start]

t1 = time.time()
for epoch in range(1):
    for batch in range(0, len(data), batch_size):                   
        data_batch_mgpu = split_and_load(data[batch:batch+batch_size], devices)
#nd.waitall()
t2 = time.time()
print("total time = {:2.1f} seconds".format(t2-t1))