Today I started using MXNet’s Gluon.cv
imagenet training script. I used the MobileNet1.0
bash config presented here(classification.html).
A single epoch takes more than 2 hours (2hours and 35 minutes! to be exact) to complete!!, while in Pytorch
for example, it took around 45 minutes using my GTX1080
.
I have a 4790K @4.5Ghz, and a Samsung 840EVO 250G from which I’m reading my training data. I have both CUDA 9.0, and cudnn 7.4 installed and ready.
GPU load is constantly at 99~100%.
The GPU fans are at 45% speed!
8.6G/15.6G of system RAM is used.
And I’m on Ubuntu 16.04.5.
mxnet version : 1.3.1
GluonCV version : 0.4.0
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48 Driver Version: 390.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 On | N/A |
| 46% 63C P2 90W / 200W | 7380MiB / 8116MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1091 G /usr/lib/xorg/Xorg 172MiB |
| 0 2867 G compiz 184MiB |
| 0 15164 C python 7017MiB |
+-----------------------------------------------------------------------------+
Although I can see 20 threads in htop
, however, only one thread at a time consumes nearly 100% of CPU time, other threads consume less than 16% (and it gets lesser for each remaining thread)!
by the way this is how I initiated the training :
python train_imagenet.py \
--rec-train /media/ssd/rec/train.rec --rec-train-idx /media/ssd/rec/train.idx \
--rec-val /media/ssd/rec/val.rec --rec-val-idx /media/ssd/rec/val.idx \
--model simpnet1.0 --mode hybrid \
--lr 0.4 --lr-mode cosine --num-epochs 200 --batch-size 256--num-gpus 1 -j 20 \
--use-rec --dtype float16 --warmup-epochs 5 --no-wd --label-smoothing --mixup \
--save-dir params_mobilenet1.0_mixup \
--logging-file simpnet1.0_mixup.log
Update:
At first I thought maybe this is because of using float16
as the dtype! so I also tried float32
and the problem still persisted! that is, the training performance did not change at all!
Also the way GPU
is being utilized is very weird. while it says its 100%
under load, the temps never go up beyond 62~64C .This is weird because the fans are also nearly at idle speed. Usually a load of 100% results in 70/72C in temperature.
Also the CPU utilization is weird. it doesn’t matter if I use 4 threads or 20 threads, the CPU
utilization is the same almost.
When training in Pytorch
, I’d use 20 threads, and all 8 threads were utilized nearly to the max!, and the GPU utilization was between 89~99% and the temp was around 72/74C and each epoch would take around 45 minutes to complete and definitely not nearly 3.44x times more as in mxnet.
I guess there should be a bug somewhere here, this doesn’t make sense to me at all.
Any help is greatly appreciated .
Thanks in advance