Hi, in the official example code of YOLOv3, the loss terms seem to be calculated over tensors of shape
(B, all_feature_map_locations * 3, -), which means all the possible anchor boxes are used in the training. However, out of
all_feature_map_locations * 3 anchor boxes, only N boxes have corresponding ground truth associated with it, N being the number of ground truth boxes in the image, which is extremely small (usually 3, 4 or no more than 10). So, it seems to me that the number of negative examples are much more overwhelming than the positives in one batch, which seems problematic.
Anyone can help explain the implementation of the official YOLOv3 code? Is sampling balance really not taken care of here?