Default YOLOv3 does not improve

Hi,

I am trying to run the train_yolo3.py example file, with everything, but batch size, as default. All the losses are nan during the training process and after a whole epoch the reported mAP is 0.0. The training seems to be happening, there is load on my GPU, but the model won’t improve.

This is happening on the default voc dataset and also in a custom dataset, I have not tested with COCO.

python3 train_yolo3.py --network darknet53 --dataset voc --gpus 0,1,2,3,4,5,6,7 --batch-size 64 -j 16 --log-interval 100 --lr-decay-epoch 160,180 --epochs 200 --syncbn --warmup-epochs 4

Here is the parameters used for the model zoo.

Note that the batch size can have a big influence on the final results since a lot of hyperparameters (learning rate especially) are tied to it. Try diminishing the default learning rate and see if it helps.

Thank you. I have trained SSD and Faster R-CNN with reduced batch sizes with success. I would never have imagined that the batch size would cause this problem. I reduced the learning rate and it now trains successfully.

I tried reducing learning rate and changing other parameters. I tried many different values for initial learning rate but Yolo always end up with a considerably lower mAP when compared to SSD and Faster R-CNN.
Another problem that I spotted is that Yolo is committing greater localization errors and also has very low confidence on the predictions, with a threshold of 0.5 most objects are missed.