Question about the network traffic pattern during distributed learning


#1

Hello there,
Recently I am studying the network traffic pattern of MXNet during distributed learning. I launched a 1-server and 2-worker pseudo-cluster on the same machine with code from here. The worker is modified as follows:

with mx.Context(mx.gpu(1)): # gpu(0) for worker2.py
	# Data
	X = np.arange(1000, step=0.001)
	Y = f(X)

	# Split data for taining and evaluation
	X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

	kv_store = mx.kv.create('dist_sync')
	batch_size = 16384
	train_iter = mx.io.NDArrayIter(X_train, Y_train, batch_size, shuffle=True,label_name='lin_reg_label')
	eval_iter = mx.io.NDArrayIter(X_test, Y_test, batch_size, shuffle=False)

	X = mx.sym.Variable('data')
	Y = mx.symbol.Variable('lin_reg_label')
	fully_connected_layer = X
	for i, dim in enumerate([256, 64]):
		fully_connected_layer = mx.sym.FullyConnected(data=fully_connected_layer, num_hidden=dim, name='FC%d' % i)
		fully_connected_layer = mx.sym.Activation(data=fully_connected_layer, act_type='relu', name='ReLU%d' % i)
	fully_connected_layer = mx.sym.FullyConnected(data=fully_connected_layer, name='fcfinal', num_hidden = 1)
	#fully_connected_layer  = mx.sym.FullyConnected(data=X, name='fc1', num_hidden = 1)
	lro = mx.sym.LinearRegressionOutput(data=fully_connected_layer, label=Y, name="lro")

	model = mx.mod.Module(
	    symbol = lro ,
	    data_names=['data'],
	    label_names = ['lin_reg_label']# network structure
	)

	model.fit(train_iter, eval_iter,
		    optimizer_params={
			'learning_rate':0.0000001},
		    num_epoch=50,
		    eval_metric='mae',
		    batch_end_callback
			 = mx.callback.Speedometer(batch_size, 20),
		    kvstore=kv_store)

Scheduler logs:

[07:47:38] src/van.cc:75: Bind to role=scheduler, id=1, ip=127.0.0.1, port=9000, is_recovery=0
[07:47:38] src/van.cc:236: assign rank=9 to node role=worker, ip=10.141.48.61, port=39649, is_recovery=0
[07:47:38] src/van.cc:236: assign rank=8 to node role=server, ip=10.141.48.61, port=48947, is_recovery=0
[07:47:38] src/van.cc:236: assign rank=11 to node role=worker, ip=10.141.48.61, port=49120, is_recovery=0

One worker logs:

INFO:root:Epoch[0] Batch [20]	Speed: 135717.08 samples/sec	mae=2499.673165
INFO:root:Epoch[0] Batch [40]	Speed: 141046.33 samples/sec	mae=2502.987744
INFO:root:Epoch[0] Train-mae=2503.911768
INFO:root:Epoch[0] Time cost=5.375
INFO:root:Epoch[0] Validation-mae=2502.035431
INFO:root:Epoch[1] Batch [20]	Speed: 154521.91 samples/sec	mae=2499.673084
INFO:root:Epoch[1] Batch [40]	Speed: 148996.23 samples/sec	mae=2502.987708
INFO:root:Epoch[1] Train-mae=2503.911768
INFO:root:Epoch[1] Time cost=4.864
INFO:root:Epoch[1] Validation-mae=2502.035324
INFO:root:Epoch[2] Batch [20]	Speed: 149160.07 samples/sec	mae=2499.672956
INFO:root:Epoch[2] Batch [40]	Speed: 158714.38 samples/sec	mae=2502.987622
INFO:root:Epoch[2] Train-mae=2503.911719
INFO:root:Epoch[2] Time cost=4.832
INFO:root:Epoch[2] Validation-mae=2502.035217
...

I recorded all network traffic during this distributed training and plotted it onto the following graph:

As I understand it, the valley represents the gradient reduction procedure of each batch on parameter server, and the red line (tcp.dstport=48947) represents push flow from worker to parameter server, which should happen at the end of each batch, while the green line(tcp.dstport == 39649 || 49120) stands for pull flow from server to worker, which is expected to start at the beginning of each batch.

But according to the graph, push/pull traffics show high correlationships and remain steady in a batch. Why does a worker keep pushing gradients before the end of a batch? Or is there anything wrong with my experiment? Any help would be greatly appreciated!

Thanks


#2

Finally I browsed through the source code and KVStore docs and got the answer: Push/Pull are executed asynchronously. And the biggest mistake in this experiment is that the execution time for one batch is too short, so network traffic in different batches get mixed together due to network scheduling algorithm. I increased the batch size and plotted the graph meeting my expectation:

In this graph, there are 4 batches for each epoch. Red line still represents gradient-pushing flow. Blue and green lines stand for parameter pulling flow from worker 1/2.