Evaluate accuracy on multi GPU machine


#1

Hello MxNet’ers!

I am trying to adopt single gpu example https://gluon.mxnet.io/chapter04_convolutional-neural-networks/cnn-gluon.html
to run on multi-gpu machine. I am stuck with properly defining evaluate_accuracy function. My goal is to split load training data once into GPUs and then evaluate accuracy w/o repeated loading the test dataset. Here is what I have so far:

define context:

ctx = [mx.gpu(i) for i in range(num_gpus)]

load test data into

def transform(data, label):
return nd.transpose(data.astype(np.float32), (2,0,1))/255, label.astype(np.float32)

test_data = gluon.data.DataLoader(gluon.data.vision.MNIST(train=False, transform=transform),
batch_size, shuffle=False, num_workers=4)

for data, label in test_data:
data_list = gluon.utils.split_and_load(data, ctx)
label_list = gluon.utils.split_and_load(label, ctx)

Score on the trained net and try get accuracy numbers from each GPU:

acc = [mx.metric.Accuracy() for i in range(num_gpus)]

for i, (data, label) in enumerate(zip(data_list, label_list)):
data = data.as_in_context(mx.gpu(i))
label = label.as_in_context(mx.gpu(i))
predictions = nd.argmax(net(data), axis=1)
acc[i].update(preds=predictions, labels=label)
acc[i].get()[1]
print(acc[0].get()[1], acc[1].get()[1], acc[2].get()[1], acc[3].get()[1])

Output:
1.0 1.0 1.0 1.0

I don’t like that I have to calculate predictions sequentially and not sure if my for loop is entirely correct. Appreciate any insights.


#2

Hi,

are you sure you want to do the evaluation in multi-gpu context? Usually the evaluation dataset is smaller than then train dataset and can take place in a single gpu. There is unnecessary copying in the evaluation script - not your fault. You can understand this by looking at the definition of the Accuracy within mxnet, here. You will see that there the nd.array constructs are first transformed into numpy inside the Accuracy metric, therefore, there is no point in copying the label onto gpu context which is expensive (this is not the case for loss functions, only for the Accuracy and I think F1 and MCC metrics, but please check these).

Try this:

acc = mx.metrix.Accuracy() # Single accuracy 


for i, (data,labels) in enumerate(zip(data_list,label_list)):
    data = data.as_in_context(mx.gpu(0))
    label = nd.array(label) # keep this in cpu context, since this is already done inside the definition of Accuracy
    predictions = nd.argmax(net(data),axis=1).as_in_context(mx.cpu()) . 
    acc.update(preds=predictions,labels=label)

print (acc.get()[1])

If you want to go parallel (please try the above before going), try this:

GPU_COUNT = 4 
ctx = [mx.gpu(i) for i in range(GPU_COUNT)]
# net.initialize(ctx=ctx) # You need to initialize the network on multiple gpus, 
# Do the training ... 
net.load_parameters(filename, ctx=ctx) # Or read it from file and load it on multiple contexts

acc = mx.metrix.Accuracy() # Single accuracy 


for i, (data,labels) in enumerate(zip(data_list,label_list)):
    data = gluon.utils.split_and_load(data, ctx)
    label = nd.array(label) # keep this in cpu context, since this is already done inside the definition of Accuracy
    
    # Perform inference on each separate GPU 
    predictions = [nd.argmax(net(X)).as_in_context(mx.cpu()) for X in data]
    predictions = nd.concat(*predictions,dim=0) # Collect results
    acc.update(preds=predictions,labels=label) # update single accuracy

print (acc.get()[1])

I haven’t tested the code, looks OK though :slight_smile:

Hope this helps.


#3

Hey Feevos, thanks for the feedback!

For solution #1 I understand accuracy won’t be calculated over the whole training or test datasets (since we split load into multiple GPUs contexts), and calculation goes over gpu(0). Of course if I load both datasets into gpu(0) this will work (although I don’t see a point to do so in multi GPU system). Is my understanding correct?

Therefore I came to the idea of calculating accuracy across all gpus. Just not sure what is the most efficient way of doing this. I will try your solution #2 now. Thanks!

Update (commented “data =” as data already loaded and accessible in each gpu):

def eval_acc_feevos(net, data_l, label_l):
    acc = mx.metric.Accuracy() # Single accuracy 
    for i, (data, label) in enumerate(zip(data_l, label_l)):
        # data = gluon.utils.split_and_load(data, ctx)
        label = nd.array(label) # keep this in cpu context, since this is already done inside the definition of Accuracy   
        # Perform inference on each separate GPU 
        pred = [nd.argmax(net(X)).as_in_context(mx.cpu()) for X in data]
        pred = nd.concat(*pred,dim=0) # Collect results
        acc.update(preds=pred,labels=label) # update single accuracy

    return (acc.get()[1])

Error:
TypeError: source_array must be array like object

For now I sorted to this ugly (and inefficient) accuracy function, it is slow but works :slight_smile:

def eval_acc(net, data_l, label_l):
    acc = [mx.metric.Accuracy() for i in range(num_gpus)]
    for i, (data, label) in enumerate(zip(data_l, label_l)): # loop on 235 batches
        D=[data[n].as_in_context(mx.gpu(n)) for n in range(0,num_gpus)]
        L=[label[n].as_in_context(mx.gpu(n)) for n in range(0,num_gpus)]
        P = [nd.argmax(net(d), axis=1) for d in D]
        [a.update(preds=p, labels=l) for p, a, l in zip(P, acc, L)]
    return sum([a.get()[1] for a in acc])/num_gpus

My code is here: https://github.com/dimon777/examples/blob/master/mxnet/cnn_multigpu_mnist.ipynb

Thanks!


#4

Hi @dimon777,

I debugged the code, the definitions of accuracy both work, but you need to feed them in a data generator (object gluon.nn.DataLoader in your code). In your code you were definining objects train_data_l, train_label_l etc for the evaluation of the accuracy, which are not necessary since you have defined data generators (cell 4 from top). Both of these implementations run on my laptop (nvidia Quadro M3000M, single one).

# Runs
def eval_acc_feevos1(net, _data_generator):
    acc = mx.metric.Accuracy() # Single accuracy 
    for i, (tdata, tlabel) in enumerate(_data_generator):
        data = tdata.as_in_context(mx.gpu(0))
        label = nd.array(tlabel) # keep this in cpu context, since this is already done inside the definition of Accuracy
        pred = nd.argmax(net(data),axis=1).as_in_context(mx.cpu())
        acc.update(preds=pred,labels=label)
    return (acc.get()[1])

# Runs
def eval_acc_feevos2(net, _data_generator):
    acc = mx.metric.Accuracy() # Single accuracy 
    for i, (tdata, tlabel) in enumerate(_data_generator):
        data = gluon.utils.split_and_load(tdata, ctx)
        label = nd.array(tlabel) # keep this in cpu context, since this is already done inside the definition of Accuracy   
        # Perform inference on each separate GPU 
        pred = [nd.argmax(net(X),axis=1).as_in_context(mx.cpu()) for X in data]
        pred = nd.concat(*pred,dim=0) # Collect results
        
        acc.update(preds=pred,labels=label) # update single accuracy

    return (acc.get()[1])

I noticed a small performance gain in comparison with your code (this is not due to hybridize), but I guess this should prove beneficial when you have heavy models and a lot of data (so inference on multiple gpus makes sense).

Code here and here.

Cheers!


#5

@feevos

Your code really rocks and saves 7 seconds compare to my implementation. So, thanks!

^^^ do I understand this will copy whole dataset into gpu(0) as many times as we call the accuracy function? ^^^

^^^ will this also copy datasets to gpus on each call of accuracy function? ^^^

If this is the case, I am not fully getting why is this faster compare to loading datasets once (as in my naive implementation) and calculating averages against data already loaded into GPUs.

Cheers!


#6

Happy I could be of some help @dimon777.

yes, but in batches of data.

It will copy in each gpu a portion of the data (say, 4 GPUs, will copy 1/4 of the data in each gpu).

In principle you should be able to optimize further if you copy once the data onto the gpu, assuming that you can fit them (and keep them) there during training. I don’t know how this works in a for loop environment, or in the list constructs you have.

I am afraid I do not know this. Someone more experienced hopefully will help.

Cheers!