Hello there,
Recently I am studying the network traffic pattern of MXNet during distributed learning. I launched a 1-server and 2-worker pseudo-cluster on the same machine with code from here. The worker is modified as follows:
with mx.Context(mx.gpu(1)): # gpu(0) for worker2.py
# Data
X = np.arange(1000, step=0.001)
Y = f(X)
# Split data for taining and evaluation
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
kv_store = mx.kv.create('dist_sync')
batch_size = 16384
train_iter = mx.io.NDArrayIter(X_train, Y_train, batch_size, shuffle=True,label_name='lin_reg_label')
eval_iter = mx.io.NDArrayIter(X_test, Y_test, batch_size, shuffle=False)
X = mx.sym.Variable('data')
Y = mx.symbol.Variable('lin_reg_label')
fully_connected_layer = X
for i, dim in enumerate([256, 64]):
fully_connected_layer = mx.sym.FullyConnected(data=fully_connected_layer, num_hidden=dim, name='FC%d' % i)
fully_connected_layer = mx.sym.Activation(data=fully_connected_layer, act_type='relu', name='ReLU%d' % i)
fully_connected_layer = mx.sym.FullyConnected(data=fully_connected_layer, name='fcfinal', num_hidden = 1)
#fully_connected_layer = mx.sym.FullyConnected(data=X, name='fc1', num_hidden = 1)
lro = mx.sym.LinearRegressionOutput(data=fully_connected_layer, label=Y, name="lro")
model = mx.mod.Module(
symbol = lro ,
data_names=['data'],
label_names = ['lin_reg_label']# network structure
)
model.fit(train_iter, eval_iter,
optimizer_params={
'learning_rate':0.0000001},
num_epoch=50,
eval_metric='mae',
batch_end_callback
= mx.callback.Speedometer(batch_size, 20),
kvstore=kv_store)
Scheduler logs:
[07:47:38] src/van.cc:75: Bind to role=scheduler, id=1, ip=127.0.0.1, port=9000, is_recovery=0
[07:47:38] src/van.cc:236: assign rank=9 to node role=worker, ip=10.141.48.61, port=39649, is_recovery=0
[07:47:38] src/van.cc:236: assign rank=8 to node role=server, ip=10.141.48.61, port=48947, is_recovery=0
[07:47:38] src/van.cc:236: assign rank=11 to node role=worker, ip=10.141.48.61, port=49120, is_recovery=0
One worker logs:
INFO:root:Epoch[0] Batch [20] Speed: 135717.08 samples/sec mae=2499.673165
INFO:root:Epoch[0] Batch [40] Speed: 141046.33 samples/sec mae=2502.987744
INFO:root:Epoch[0] Train-mae=2503.911768
INFO:root:Epoch[0] Time cost=5.375
INFO:root:Epoch[0] Validation-mae=2502.035431
INFO:root:Epoch[1] Batch [20] Speed: 154521.91 samples/sec mae=2499.673084
INFO:root:Epoch[1] Batch [40] Speed: 148996.23 samples/sec mae=2502.987708
INFO:root:Epoch[1] Train-mae=2503.911768
INFO:root:Epoch[1] Time cost=4.864
INFO:root:Epoch[1] Validation-mae=2502.035324
INFO:root:Epoch[2] Batch [20] Speed: 149160.07 samples/sec mae=2499.672956
INFO:root:Epoch[2] Batch [40] Speed: 158714.38 samples/sec mae=2502.987622
INFO:root:Epoch[2] Train-mae=2503.911719
INFO:root:Epoch[2] Time cost=4.832
INFO:root:Epoch[2] Validation-mae=2502.035217
...
I recorded all network traffic during this distributed training and plotted it onto the following graph:
As I understand it, the valley represents the gradient reduction procedure of each batch on parameter server, and the red line (tcp.dstport=48947
) represents push flow from worker to parameter server, which should happen at the end of each batch, while the green line(tcp.dstport == 39649 || 49120
) stands for pull flow from server to worker, which is expected to start at the beginning of each batch.
But according to the graph, push/pull traffics show high correlationships and remain steady in a batch. Why does a worker keep pushing gradients before the end of a batch? Or is there anything wrong with my experiment? Any help would be greatly appreciated!
Thanks