Color Blind SSD (VGG-16) model


#1

I am using vgg16_atrous_voc model straight out of the zoo (pre-trained). My overall project is related to object tracking where occlusion is a major problem - experimenting with a classic shell game - in my case 3 cups, 1 M&M. i.e. start with M&M under cup #2, shuffle the cups and the computer should tell where it ends. I’m only at the SSD phase. I am training my SSD to detect red cups, blue cups, yellow cups, M&M and hands - (5 classes). I’m using 3 different colored cups because it will help me later with labeled data when working on the actual shuffle videos. With that background - here is my question:

So far I have 6700 training images & 2200 validation images. The model is excellent at drawing bounding boxes around the cups - SUPER accurate. However it seems to be color blind. The ability to distinguish red/yellow/blue is appalling. It tends to think blue cups are yellow, red cups are blue etc. If there are 4 cups in an image - all the same color - it typically thinks they are all the same color - but the wrong color. It’s not random distribution.

below is a sample of the model’s output as it trains.

  • i removed _color_distort from the default SSD transformations. It looked like this would be a major problem. after removing it - didn’t really make much difference.
  • tried VOC07MApMetric (each class converges to 0.90909) then went to VOCMApMetric (currently in use)
  • each time I add training data, it improves slightly but it still acts like it is color blind.
  • I reviewed training data - visually. out of 100s of images, no labeling mistakes - red cups are labeled as red cups

I need some clues - why does it do so poorly in distinguishing the difference in the cup colors.

INFO:root:[Epoch 203][Batch 99], Speed: 19.662 samples/sec, CrossEntropy=0.669, SmoothL1=0.187
INFO:root:[Epoch 203][Batch 199], Speed: 19.257 samples/sec, CrossEntropy=0.667, SmoothL1=0.198
INFO:root:[Epoch 203] Training cost: 429.114, CrossEntropy=0.667, SmoothL1=0.198
INFO:root:[Epoch 203] Validation:
red_cup=0.9912927244147998
blue_cup=0.9924116017115758
yellow_cup=0.9973767699777849
mm=0.9789341719642365
hand=0.9913467866058934
mAP=0.9902724109348581
INFO:root:[Epoch 204][Batch 99], Speed: 19.764 samples/sec, CrossEntropy=0.655, SmoothL1=0.186
INFO:root:[Epoch 204][Batch 199], Speed: 21.072 samples/sec, CrossEntropy=0.664, SmoothL1=0.195
INFO:root:[Epoch 204] Training cost: 422.747, CrossEntropy=0.665, SmoothL1=0.196
INFO:root:[Epoch 204] Validation:
red_cup=0.9910397455201225
blue_cup=0.992412701231508
yellow_cup=0.9978432096273536
mm=0.9747262591075023
hand=0.9937200170226307
mAP=0.9899483865018235
INFO:root:[Epoch 205][Batch 99], Speed: 21.189 samples/sec, CrossEntropy=0.652, SmoothL1=0.189
INFO:root:[Epoch 205][Batch 199], Speed: 20.269 samples/sec, CrossEntropy=0.654, SmoothL1=0.194
INFO:root:[Epoch 205] Training cost: 424.573, CrossEntropy=0.656, SmoothL1=0.196
INFO:root:[Epoch 205] Validation:
red_cup=0.9911958317438249
blue_cup=0.9924353991033601
yellow_cup=0.997228002813514
mm=0.9798318426368635
hand=0.993657171120405
mAP=0.9908696494835935
INFO:root:[Epoch 206][Batch 99], Speed: 15.724 samples/sec, CrossEntropy=0.666, SmoothL1=0.202
INFO:root:[Epoch 206][Batch 199], Speed: 21.082 samples/sec, CrossEntropy=0.661, SmoothL1=0.199
INFO:root:[Epoch 206] Training cost: 423.563, CrossEntropy=0.660, SmoothL1=0.197
INFO:root:[Epoch 206] Validation:
red_cup=0.9911916919932862
blue_cup=0.9924728659304526
yellow_cup=0.9977561538034854
mm=0.9792182166069145
hand=0.9927485332397232
mAP=0.9906774923147724


#2

Hi @duffjay,

Sounds like a fun project! My initial thoughts are that the pre-trained model is in fact colour blind, and that it would take a large degree of fine tuning to make the network use colour features in its final prediction.

You’ve correctly identified colour distortion as something that could mess up fine-tuning: if this was used during pre-training, the network would have avoided using colour features of the image. When you say you’ve removed it, was that just for fine-tuning or training the network from scratch with VOC? You’d ideally want to do the latter.

And even without the colour augmentation, the model might not use colour features when trained on the VOC dataset. I think the 20 classes from VOC dataset would be easily identified in greyscale images, so the network still might not use colour features.

Some other ideas I can think of would be to use multi-scale features for your predictions (since I’d expect the first layers to pick up on colour more than later layers for this task), or perform a secondary task to identify colour within the bounding box output the object detection network.