Problem when hybridizing with sparse dot


I’m using MxNet 1.2.0 and have a HybridBlock model in which I perform:, vec)

where adj_mat is a big 2D CSRNDArray, and vec is a 2D NDArray.

When I run model.forward without hybridizing, my entire training loop runs fine (run model, compute loss, run backward, take step, repeat.). But when I hybridize my model and try to train, I get an error that contains the line:

sparse dot does not support computing the gradient of the csr/lhs


  1. Why does this only happen when I hybridize my model?
  2. Is this error happening because the model is trying to compute the gradient of adj_mat? If so, how can I tell the model not to compute this gradient, since I don’t need it?


Hi, could you post a minimum reproducible example?
Does it still happen if you install the latest nightly version? pip install mxnet --pre
This test seems to pass:


Updating to the nightly fixed the error - thanks! But unfortunately I think I’m still doing something wrong, since hybridizing doesn’t seem to improve network performance as much as the examples suggest it should.

Here’s an minimum reproducible example:

import time
import mxnet as mx
from mxnet import nd, gluon

class GGNN(gluon.HybridBlock):
    def __init__(self, hidden_size, **kwargs):
        self.hidden_size = hidden_size

        with self.name_scope():
            self.message_fxns = []
            for t in range(10):
                layer = gluon.nn.Dense(self.hidden_size, in_units=self.hidden_size)
            self.hidden_gru = gluon.rnn.GRUCell(self.hidden_size, input_size=self.hidden_size)

    def compute_messages(self, F, values, edges):
        summed_msgs = []
        for adj_mat, msg_fxn in zip(edges, self.message_fxns):
            passed_msgs = msg_fxn(values)
            summed_msgs.append(, passed_msgs))
        values = F.sum(F.stack(*summed_msgs), axis=0)
        return values

    def update_values(self, F, values, messages):
        values, _ = self.hidden_gru(messages, [values])
        return values

    def hybrid_forward(self, F, values, *args, **kwargs):
        edges = args[0]
        for t in range(8):
            messages = self.compute_messages(F, values, edges)
            values = self.update_values(F, values, messages)
        return values

def time_model(model, ctx):
    tic = time.time()
    for b in range(10):
        values = nd.random.normal(shape=(10000, hidden_size), ctx=ctx)
        edges = [nd.random.normal(shape=(10000, 10000), ctx=ctx) for _ in range(10)]
        model(values, edges)
    return time.time() - tic

if __name__=='__main__':
    hidden_size = 64
    ctx = mx.gpu(0)
    model = GGNN(hidden_size)
    print('Without hybridize: {}'.format(time_model(model, ctx)))
    print('With hybridize: {}'.format(time_model(model, ctx)))

This returns
“Without hybridize: 3.265143871307373
With hybridize: 3.2073848247528076”
for me.

Is it possible that the mxnet scheduler is somehow not performing the for loop in compute_messages in parallel?