Siamese network issue, No gradient, No clue why

Hello, I’ve been trying to reproduce this experiment
https://github.com/adambielski/siamese-triplet/blob/master/Experiments_MNIST.ipynb using MXNet.
The classification part was done, no problem there, but the siamese part is not converging as expected.

First thing first, my dataset is based on the MNIST dataset available in gluon

class SiameseNetworkDataset(datasets.MNIST):
    def __init__(self, root, train, transform=None):
        super().__init__(root,train, transform=transform)
        self.root = root
        self.transform = transform
        
        self._data = self._data.transpose((0, 3, 1,2)).astype('float32')/255
        self._label_indexes = {
            i : np.where(self._label==i)[0] for i in range(10)
        }
        
    def __getitem__(self, index):
        items_with_index = list(enumerate(self._label))
        img0_index, img0_tuple = random.choice(items_with_index)
        # we need to make sure approx 50% of images are in the same class
        should_get_same_class = random.randint(0, 1)
        if should_get_same_class:
            img1_index = random.choice(self._label_indexes[img0_tuple])
            img1_tuple = self._label[img1_index]
                
        else:
            img1_index, img1_tuple = random.choice(items_with_index)

        img0 = self._data[img0_index]
        img1 = self._data[img1_index]
        
        
        return img0, img1, mx.nd.array(mx.nd.array([int(img1_tuple != img0_tuple)]))

    def __len__(self):
        return super().__len__()

It does what I want and give back 3 output two of shape (batch_size, 1, 28, 28) and of shape (batch_size, 1). So I doubt the error comes from here.

Then, there is the model, I’m using the same as the sus-mentionned exemple

    def __init__(self):
        super(SiameseNetwork, self).__init__()
        with self.name_scope():
            self.CNN = nn.HybridSequential()
            self.CNN.add(nn.Conv2D(channels=32, kernel_size=5),
                         nn.PReLU(),
                         nn.MaxPool2D(pool_size=2, strides=2),
                         nn.Conv2D(channels=64, kernel_size=5),
                         nn.PReLU(),
                         nn.MaxPool2D(pool_size=2, strides=2),
                        nn.Flatten(),
                        nn.Dense(256),
                         nn.PReLU(),
                        nn.Dense(256),
                         nn.PReLU(),
                        nn.Dense(2))
           
        
    def hybrid_forward(self, F, input1, input2):
        
        output1 = self.CNN(input1)
        output2 = self.CNN(input2)

        return output1, output2

Originally the CNN was another custom HybridBlock, but right now it’s this (I thought that maybe I had an issue because of nested models, but looks like no)

Then there is the loss function

class ContrastiveLoss(Loss):
    def __init__(self, margin=2.0, weight=None, batch_axis=0, **kwargs):
        super(ContrastiveLoss, self).__init__(weight, batch_axis, **kwargs)
        self.margin = margin

    def hybrid_forward(self, F, output1, output2, label):
        euclidean_distance = F.sum(F.square(output1-output2),axis=1)
        loss_contrastive = F.mean((1-label) * F.square(euclidean_distance) +
                                   label * F.square(F.sqrt(F.clip((self.margin - euclidean_distance), 0.0, 10))))
        return loss_contrastive

I tried many things here, using different ndarray function (norm, relu, sum, etc…) basically always the same error function but written in a different way. I checked the shape at every step of the process to be sure that my vector product were doing the right thing (not going to matrix of shape (batch_size, batch_size) for instance.
It might be one of the culprit, but if it is, I might be not familiar enough with mxnet to see it.

Finally the training loop is

for epoch in range(0, Config.train_number_epochs):
        if (epoch+1)%5 == 0:
            lr *= 0.1
            print(lr)
            trainer.set_learning_rate(lr)
        for i, data in enumerate(train_dataloader, 0):
               
            with autograd.record():
                img0, img1, label = data
                img0 = img0.copyto(mx.gpu())
                img1 = img1.copyto(mx.gpu())

                label = label.squeeze().copyto(mx.gpu())
                output1, output2 = net(img0, img1)
                
   
                loss_contrastive = loss(output1, output2, label)
                loss_contrastive.backward()
                
        
                if mx.nd.contrib.isnan(loss_contrastive):
                    print(loss_contrastive)
                    return label, output1, output2
            
            trainer.step(256)
            
            if i % 20 == 0:
                print("Epoch number {}  : Current loss = {}".format(epoch, loss_contrastive.mean().asscalar()))

I tried with the autograd before and after copying the batch to the GPU, no difference here, I would think that best practice is to have it after because I don’t care about this operation.
label is squeezed because the shape (64,1) is problematic in the loss function.
I checked the gradient recorded in the model with net.CNN[0].weight.grad and they were non-null.
However, I did try to check the loss gradient w.r.t output1 and output2 and got an error saying that they were not in a computational graph.

I’m open to any suggestion (especially if you have some on the dataset, the dataloading is slow as hell), if you have any idea on why it is not training.

I will try to transfer learn from the classification to the metric learning, maybe it’s just lost in a local minima, but I really doubt it because when I plot the embedding in a graph, it’s just a point cloud without any order.

Thanks for reading, It’s the first time I post something in any technical forum, so if something is not of good form in this post, please do tell me as well :slight_smile:

I figured out the slow dataloader, and all my dataset was not used…
turned that

def __getitem__(self, index):
        items_with_index = list(enumerate(self._label))
        img0_index, img0_tuple = random.choice(items_with_index)

into this

def __getitem__(self, index):
        img0_tuple = self._label[index]

and put the item_with_index as a member, so it’s not computed everytime for
img1_index, img1_tuple = random.choice(self._items_with_index)

But still not converging, embedding plot looking always the same despite the loss being reduced…

I found this while going through the doc, but it doesn’t seems to work on MNIST …