Help with SSD SmoothL1 metric reporting NaN during training

Greetings everyone,

I apologize in advance for any inconvenience as this is my first post.

I am trying to train an SSD Model from GluonCV on a custom dataset, created as an LSTRecord

I am referencing:
[1]https://gluon.mxnet.io/chapter08_computer-vision/object-detection.html
[2]https://gluon-cv.mxnet.io/build/examples_detection/finetune_detection.html#sphx-glr-build-examples-detection-finetune-detection-py

I am encountering an issue whereby the SmoothL1 metric used in [2] is reporting Nan; my model
is unable to detect my target object in a preliminary test.

To diagnose the issue, I tried printing out the anchor boxes generated by this snippet of code in [2]:

def get_dataloader(net, train_dataset, data_shape, batch_size, num_workers):
from gluoncv.data.batchify import Tuple, Stack, Pad
from gluoncv.data.transforms.presets.ssd import SSDDefaultTrainTransform
width, height = data_shape, data_shape
# use fake data to generate fixed anchors for target generation
with autograd.train_mode():
_, _, anchors = net(mx.nd.zeros((1, 3, height, width)))
batchify_fn = Tuple(Stack(), Stack(), Stack()) # stack image, cls_targets, box_targets
train_loader = gluon.data.DataLoader(
train_dataset.transform(SSDDefaultTrainTransform(width, height, anchors)),
batch_size, True, batchify_fn=batchify_fn, last_batch=‘rollover’, num_workers=num_workers)
return train_loader

train_data = get_dataloader(net, dataset, 512, 16, 0)

the anchors were reporting Nan in the last dimension of the bounding box coordinate e.g.
[48.437504 29.437502 10.88711 nan]

Is anyone able to advise on a way to resolve this NaN issue?

As an interim solution, I am looking to generate the anchor boxes in the manner described in [1] but it lacks the OHEMSampler used by the SSDTargetgenerator, located in SSDDefaultTrainTransform which i am concerned might affect my model’s performance.

set multi_precision=True in your optimizer
Don’t know why it help…but it did work for me.

Hi @Neutron , thanks for your reply!

Unfortunately, despite setting multi_precision=True in my optimizer by modifying [2] as:

trainer = gluon.Trainer(
net.collect_params(),
‘sgd’,
{‘learning_rate’: 0.001,‘multi_precision’: True, ‘wd’: 0.0005, ‘momentum’: 0.9})

I was not able to resolve the issue.

Maybe using a small scale to intilize your parameters would help.

net.initialize(mx.init.Uniform(scale=0.01), ctx=ctx)

It is better to try different settings for a net.
For me, just using


trainer = mx.gluon.Trainer(
    params=net.collect_params(opt_str),
    optimizer='nadam',
    optimizer_params={'beta1':0.9, 'beta2':0.99, 'epsilon':1e-09, 'schedule_decay':0.004,'multi_precision':True})

and

net.initialize(mx.init.Xavier(), ctx=ctx)
net.collect_params('.*bias').initialize(mx.init.LSTMBias(forget_bias=1.0),ctx)# for LSTM only. For other bias, using mx.init.Zero()

the model fits very well

Hi @Neutron, thanks for your reply!

It seems that my dataset had issues with its ground truth labels, and was the cause of problem.

However, I will take note of your parameter initialization suggestion, should i run into further issues
with NaN values.

Thank you very much for your help, and for responding nonetheless! :slight_smile: