Multi-label classification with SigmoidBCELoss not working

SigmoidBCELoss is typically used in multilabel classification (where a single example can belong to multiple classes). When I train a model in this way using Gluon, the network doesn’t appear to learn, as can be seen from both training loss and training accuracy (e.g., the training accuracy is stuck at 53% from first epoch). An identical model trained in Keras/Tensorflow shows a gradually increasing training accuracy (and decreasing training loss) eventually reaching 100% training accuracy in about 200 epochs. What am I doing wrong here?

Here are both the Gluon and Keras examples to reproduce the problem I’m experiencing:

# MXNet Gluon:  doesn't appear to work
import mxnet as mx
from mxnet import gluon, nd
from mxnet.gluon import nn
import numpy as np

X = [[1,0,0,0,0,0,0],
      [1,2,0,0,0,0,0],
      [3,0,0,0,0,0,0],
      [3,4,0,0,0,0,0],
      [2,0,0,0,0,0,0],
      [3,0,0,0,0,0,0],
      [4,0,0,0,0,0,0],
      [2,3,0,0,0,0,0],
      [1,2,3,0,0,0,0],
      [1,2,3,4,0,0,0],
      [0,0,0,0,0,0,0],
      [1,1,2,3,0,0,0],
      [2,3,3,4,0,0,0],
      [4,4,1,1,2,0,0],
      [1,2,3,3,3,3,3],
      [2,4,2,4,2,0,0],
      [1,3,3,3,0,0,0],
      [4,4,0,0,0,0,0],
      [3,3,0,0,0,0,0],
      [1,1,4,0,0,0,0]]
                                                                    
Y = [[1,0,0,0],                                                     
    [1,1,0,0],                                                      
    [0,0,1,0],                                                      
    [0,0,1,1],
    [0,1,0,0],
    [0,0,1,0],
    [0,0,0,1],
    [0,1,1,0],
    [1,1,1,0],
    [1,1,1,1],
    [0,0,0,0],
    [1,1,1,0],
    [0,1,1,1],
    [1,1,0,1],
    [1,1,1,0],
    [0,1,0,0],
    [1,0,1,0],
    [0,0,0,1],
    [0,0,1,0],
    [1,0,0,1]]
loader = gluon.data.DataLoader(gluon.data.SimpleDataset(list(zip(X,Y))), batch_size=1)

MAXLEN = 7
MAXFEATURES = 4
ctx = mx.gpu(0)
NUM_CLASSES=4

# model
net=gluon.nn.HybridSequential()
with net.name_scope():
    net.add(nn.Embedding(MAXFEATURES+1, 50))
    net.add(nn.GlobalAvgPool1D())
    net.add(nn.Dense(NUM_CLASSES, activation='sigmoid'))
net.hybridize()
net.collect_params().initialize(mx.init.Xavier(), ctx=ctx)

# trainer
trainer = gluon.Trainer(
    params=net.collect_params(),
    optimizer='adam',
    optimizer_params={'learning_rate': 0.001},
)
metric = mx.metric.Accuracy()
loss_function = gluon.loss.SigmoidBCELoss(from_sigmoid=True)


for epoch in range(200):
    for (data, labels) in loader:
        data = data.as_in_context(ctx)
        labels = labels.astype('float32').as_in_context(ctx)
        with mx.autograd.record():
            outputs = net(data)
            loss = loss_function(outputs, labels)
        loss.backward()
        metric.update(labels, outputs)
        trainer.step(batch_size=data.shape[0])
    name, acc = metric.get()
    
    if epoch %20 == 0: print("epoch %s: %s" % (epoch, acc))
    metric.reset()

The training accuracy of this model is stuck at 53% from first epoch and predictions on training examples after training are completely off (so it’s not just an issue with the EvalMetric).

Implementing the above model in Keras/TensorFlow, as shown below, seems to train correctly as expected.

# Keras: works and achieves 100% training accuracy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import GlobalAveragePooling1D
import numpy as np

X = np.array(X)
Y = np.array(Y)
MAXLEN = 7
MAXFEATURES = 4
NUM_CLASSES = 4
model = Sequential()
model.add(Embedding(MAXFEATURES+1,
                    50,
                    input_length=MAXLEN))
model.add(GlobalAveragePooling1D())
model.add(Dense(NUM_CLASSES, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
model.fit(X, Y,
          batch_size=1,
          epochs=200,
          validation_data=(X, Y))

The Keras/Tensorflow predictions on training examples are near-perfect, but the Gluon predictions are way off.

Gluon prediction is not accurate (first position and fourth position in output array should be near 1 and others near zero):

k

Keras/Tensorflow model outputs the following for the same input, which is correct (first position and fourth position in output array near 1 and above 0.5 with remaining classes near zero):

array([[0.8340912 , 0.01359105, 0.01566718, 0.94641566]], dtype=float32)

What is the proper way to translate the Keras/TensorFlow multilabel model to MXNet Gluon?

What’s the LR in the Keras version?

I’ve tried your code with LR = 0.01… it helps but the training’s still way slower than what you see in Keras: 70% acc after more than 2000 epochs.

@spanev thanks for experimenting with the code example. The Keras version uses the exact same learning rate as the Gluon example: 0.001. (0.001 is the default learning rate for Adam in Keras, so it is not shown explicitly in the Keras code example.). The model, data, loss function, and learning rate are all identical between the Gluon and Keras code examples. Yet, the Gluon version doesn’t seem to learn in any reasonable way, which is confusing. I’m using Gluon v1.5.0, by the way.

Also, it’s not just this toy dataset that is causing the problem. I’ve tried this comparison out on a larger, real-world multi-label classification problem from Kaggle (the toxic comments competition) and am seeing the same issue. Keras/TensorFlow trains and learns the task well to high training accuracy and high ROC-AUC score on validation set, while Gluon isn’t able to learn the task at all.

Any further insights from you or anyone else would be appreciated.