Classifying Images into 11K classes with pretrained model

I am new to this forum and to mxnet and to R as well, so I apologize if my question is rudimentary. I am interested in classifying images using mxnet in the R language and I had great success following the instructions from this post about how to classify images with the “inception” pretrained model: [https://mxnet.incubator.apache.org/tutorials/r/classifyRealImageWithPretrainedModel.html]
My problem is that most of the images I have do not fall well into any of the 1000 “synsets” that are part of this model. So I want to run the same program but using a larger group of possible classes, and I have found one with 11K classes or “synsets” here: http://data.mxnet.io/models/imagenet-11k/
I tried to use the exact same code from the 1K example and plug in this synset along with the resnet-50 pretrained model.
The problem is that the results were not good. For example, a picture that was obviously an airplane and that came up with “airplane” as the first class using the 1K set only came in around 3000 in the 11K set, with totally unrelated classes like “clown” as the top classes. In trying to figure out what went wrong, I am wondering if there might be a problem with the “mean image”. I used the same “mean image” which is to say “Inception/mean_224.nd” for the 11k model because I could not find a mean image made specifically for that model. Could this be ruining all the results? I literally have no idea what a mean image even is, and have no idea how this system works at all. I am simply cutting and pasting code as this kind of deep learning is far beyond my understanding. But again, it worked beautifully with the 1k inception data and I would love to get it to work with a larger set of possible classes for my images.

Welcome to the forum @dveidlinger! Glad you’ve managed to get up and running with the ImageNet 1k Inception model.

So the ‘mean image’ is the average of all the images used for training. Let’s say you have an red, green, blue (RGB) channel image of size 299299. It has 89401, each with 3 values. You can obtain a ‘pixel mean’ by adding up all the values of that specific pixel (and colour channel) across the whole training dataset and then divide by the number of images. It’s pixel (and channel) wise. e.g. the mean top left pixel is the average of all the top left pixels in the dataset. A ‘mean image’ is this operation being done for every one of the 299299 pixels.

Why is this necessary? Well, deep learning models often need normalized input to train correctly. Values of the pixels should on average be 0, and have a standard deviation of 1. So once the mean image has been calculated, it needs to be subtracted from each image that’s given to the network (when you’re training and when you’re using the model for predictions).

As for predicting classes that aren’t included in the 1000 ImageNet classes, you should look at transfer learning. You don’t need much data for each class to get started, and you have full control over the classes that you want the model to predict. It reuses the features that the network has learnt on ImageNet and uses them for the new classes. You can find a tutorial on transfer learning at http://gluon.mxnet.io/chapter08_computer-vision/fine-tuning.html#

Given you’re new to both MXNet and R I’d also recommend that you look at the Python version, as it gives you the Gluon API, which is easy to work with and is great for debugging.

Thanks a lot. Those are very helpful ideas. I will get on it.