Some samples were skipped in the training loop

ssbusc1 · December 20, 2018, 9:49pm

While training a model with the Module API, i.e. module.fit() with an NDArrayIter, I noticed that more than half the samples were skipped over in the first epoch. In the subsequent epochs, it looks like all the samples were used.

This is from a multi-label classification problem with about 3.5 million samples, encoded into sparse matrices. Training was performed on a GPU, using a batch size of 128.

From the Speedometer logs during training:

06:29:01 WARNING:Already bound, ignoring bind()
06:29:01 WARNING:optimizer already initialized, ignoring...
06:29:29 INFO:Epoch[0] Batch [5000]	Speed: 23131.93 samples/sec	loss=0.000037
06:29:56 INFO:Epoch[0] Batch [10000]	Speed: 23457.39 samples/sec	loss=0.000037
06:30:23 INFO:Epoch[0] Train-loss=0.000068
06:30:23 INFO:Epoch[0] Time cost=81.726
06:30:51 INFO:Epoch[1] Batch [5000]	Speed: 23192.88 samples/sec	loss=0.000110
06:31:18 INFO:Epoch[1] Batch [10000]	Speed: 23508.18 samples/sec	loss=0.000113
06:31:45 INFO:Epoch[1] Batch [15000]	Speed: 23572.26 samples/sec	loss=0.000113
06:32:12 INFO:Epoch[1] Batch [20000]	Speed: 23616.47 samples/sec	loss=0.000113
06:32:40 INFO:Epoch[1] Batch [25000]	Speed: 23192.46 samples/sec	loss=0.000112
06:32:48 INFO:Epoch[1] Train-loss=0.000111
06:32:48 INFO:Epoch[1] Time cost=145.542
06:33:16 INFO:Epoch[2] Batch [5000]	Speed: 23121.89 samples/sec	loss=0.000110
06:33:44 INFO:Epoch[2] Batch [10000]	Speed: 23425.52 samples/sec	loss=0.000109
06:34:11 INFO:Epoch[2] Batch [15000]	Speed: 23443.41 samples/sec	loss=0.000108
06:34:38 INFO:Epoch[2] Batch [20000]	Speed: 23487.69 samples/sec	loss=0.000107
06:35:05 INFO:Epoch[2] Batch [25000]	Speed: 23414.13 samples/sec	loss=0.000106
06:35:14 INFO:Epoch[2] Train-loss=0.000106
06:35:14 INFO:Epoch[2] Time cost=145.740
06:35:42 INFO:Epoch[3] Batch [5000]	Speed: 23379.98 samples/sec	loss=0.000105
06:36:09 INFO:Epoch[3] Batch [10000]	Speed: 23363.26 samples/sec	loss=0.000104
06:36:36 INFO:Epoch[3] Batch [15000]	Speed: 23450.45 samples/sec	loss=0.000103
06:37:04 INFO:Epoch[3] Batch [20000]	Speed: 23494.26 samples/sec	loss=0.000102
06:37:31 INFO:Epoch[3] Batch [25000]	Speed: 23388.84 samples/sec	loss=0.000102
06:37:39 INFO:Epoch[3] Train-loss=0.000102
06:37:39 INFO:Epoch[3] Time cost=145.431
   ...

Note that the time taken in the first epoch is also significantly lower (81s vs 145s for the rest of the epochs). So it does not seem an issue with the Speedometer logging.

The training loop is fairly uncomplicated:

        train_iter = mx.io.NDArrayIter(train_data,
                               train_labels, batch_size=batch_size,
                               last_batch_handle='discard', data_name='X', label_name='Y')

        speedometer = mx.callback.Speedometer(batch_size, speedometer_frequency)

        module.fit(train_iter,
                         eval_metric='loss',
                         batch_end_callback=speedometer,
                         num_epoch=num_epoch,
                         kvstore=kvstore.create('device')
                         )

Any suggestions on tracking this down?

NRauschmayr · December 21, 2018, 8:36am

Can you try resetting the iterator in every epoch train_iter.reset()? This will hopefully fix the problem.

ssbusc1 · December 21, 2018, 5:32pm

D’oh. That may just be it. I have been running this in an interactive session, and likely missed resetting the iterator.

Topic		Replies	Views
Training with one batch gives different training/validation accuracies when shuffled	2	580	October 18, 2017
Training metrics are not equal to validation metrics, even when using the same data best-practices , how-to	1	415	March 12, 2019
Read .rec into memory and get data stats Gluon	1	693	October 15, 2018
Kernel dies when classifying test set and saving result Courses	13	1330	March 13, 2019
Speeding up Machine Translation with RNNs D2L Book performance , gpu , docs	3	421	February 22, 2019

Some samples were skipped in the training loop

Related Topics