Is there any side effect by using ‘Sequential()’ or ‘HybridSequential()’ as a container only?


I am reading a tutorial about MxNet. The writers use ‘mxnet.gluon.nn.Sequential()’ as a container to store some blocks (see code 1); then, they rewrite the connection of blocks in ‘def forward(self, x)’ (see codes 2 and 3). Is there any side effect by doing this? By the way, what is the difference between ‘Sequential()’ and ‘HybridSequential()’. I try a list to replace the ‘Sequential’, and I get following warnings doing the initialization process.

“ToySSD.downsamplers” is a container with Blocks. Note that Blocks inside the list, tuple or dict will not be registered automatically. Make sure to register them using register_child() or switching to nn.Sequential/nn.HybridSequential instead.’

As far as I know, if you put some blocks in ‘mxnet.gluon.nn.Sequential()’ or ‘mxnet.gluon.nn.HybridSequential()’, this action is telling the computer that these blocks are connected. However, if you design the relationship of blocks in the ‘forward’ function, you are telling the computer to connect these blocks in another way. Will it lead to confusion? If I only design some block connections in ‘forward’, what are the relationships of the other blocks in ‘Sequential()’ that are not designed in ‘forward’ function?

The entire tutorial can be found in here.

code 1:

def toy_ssd_model(num_anchors, num_classes):
downsamplers = nn.Sequential()
for _ in range(3):

class_predictors = nn.Sequential()
box_predictors = nn.Sequential()    
for _ in range(5):
    class_predictors.add(class_predictor(num_anchors, num_classes))

model = nn.Sequential()
model.add(body(), downsamplers, class_predictors, box_predictors)
return model

code 2:

def toy_ssd_forward(x, model, sizes, ratios, verbose=False):    
body, downsamplers, class_predictors, box_predictors = model
anchors, class_preds, box_preds = [], [], []
# feature extraction    
x = body(x)
for i in range(5):
    # predict
        x, sizes=sizes[i], ratios=ratios[i]))
    if verbose:
        print('Predict scale', i, x.shape, 'with', 
              anchors[-1].shape[1], 'anchors')
    # down sample
    if i < 3:
        x = downsamplers[i](x)
    elif i == 3:
        x = nd.Pooling(
            x, global_pool=True, pool_type='max', 
            kernel=(x.shape[2], x.shape[3]))
# concat data
return (concat_predictions(anchors),

code 3:

from mxnet import gluon
class ToySSD(gluon.Block):
def __init__(self, num_classes, verbose=False, **kwargs):
    super(ToySSD, self).__init__(**kwargs)
    # anchor box sizes and ratios for 5 feature scales
    self.sizes = [[.2,.272], [.37,.447], [.54,.619], 
                  [.71,.79], [.88,.961]]
    self.ratios = [[1,2,.5]]*5
    self.num_classes = num_classes
    self.verbose = verbose
    num_anchors = len(self.sizes[0]) + len(self.ratios[0]) - 1
    # use name_scope to guard the names
    with self.name_scope():
        self.model = toy_ssd_model(num_anchors, num_classes)

def forward(self, x):
    anchors, class_preds, box_preds = toy_ssd_forward(
        x, self.model, self.sizes, self.ratios, 
    # it is better to have class predictions reshaped for softmax computation       
    class_preds = class_preds.reshape(shape=(0, -1, self.num_classes+1))
    return anchors, class_preds, box_preds


In Gluon, networks are build using Blocks. If something is not a Block, it cannot be part of a Gluon network. Dense layer is a Block, Convolution is a Block, Pooling layer is a Block, etc.

Sometimes you might want a Block that is not a pre-defined block in Gluon but is a sequence of predefined Gluon blocks. For example,

Conv2D -> MaxPool2D -> Conv2D -> MaxPool2D -> Flatten -> Dense -> Dense

Gluon doesn’t have a pre-defined block that does the above sequence of operation. But Gluon does have Blocks that does each of the individual operation. So, you can create your own block that does the above sequence of operation by stringing together predefined Gluon blocks. Example:

net = gluon.nn.HybridSequential()

with net.name_scope():

    # First convolution
    net.add(gluon.nn.Conv2D(channels=20, kernel_size=5, activation='relu'))
    net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))

    # Second convolution
    net.add(gluon.nn.Conv2D(channels=50, kernel_size=5, activation='relu'))
    net.add(gluon.nn.MaxPool2D(pool_size=2, strides=2))

    # Flatten the output before the fully connected layers

    # First fully connected layers with 512 neurons
    net.add(gluon.nn.Dense(512, activation="relu"))

    # Second fully connected layer with as many neurons as the number of classes

When you create a sequence like that, you can either use HybridSequential or Sequential. To understand the difference, you need to understand the difference between symbolic and imperative programming.

  • HybridBlock is a Block that can be converted into symbolic graph for faster execution. HybridSequential is a sequence of Hybrid blocks.
  • Blocks (not the hybrid ones) is a Block that cannot be converted into symbolic graph. Sequential is a sequence of non hybrid Blocks.

Whether or not a block is Hybrid depends on how it is implemented. Almost all predefined Gluon blocks are also HybridBlocks. Sometimes there is reason why some blocks cannot be Hybrid. Tree LSTM is one example. More often, something is not Hybrid just because whoever wrote it didn’t put in the effort to make it Hybrid for several reasons (ex: maybe making it hybrid won’t give big performance boost or maybe it is hard to make the block hybrid).

Note that Sequential and HybridSequential are not just containers like Python list. When you use one of them, you are actually creating a new Block using preexisting blocks. This is why you cannot replace Sequential using Python list.

Okay, so you know how to create your own block by stringing together preexisting blocks. Good. What if you want to not just pass the data through a sequence of blocks? What if you want to conditionally pass the data through one of those blocks. Here is an example from ResNet:

class BasicBlockV1(HybridBlock):
    def __init__(self, channels, stride, downsample=False, in_channels=0, **kwargs):
        super(BasicBlockV1, self).__init__(**kwargs)
        self.body = nn.HybridSequential(prefix='')
        self.body.add(_conv3x3(channels, stride, in_channels))
        self.body.add(_conv3x3(channels, 1, channels))
        if downsample:
            self.downsample = nn.HybridSequential(prefix='')
            self.downsample.add(nn.Conv2D(channels, kernel_size=1, strides=stride,
                                          use_bias=False, in_channels=in_channels))
            self.downsample = None

    def hybrid_forward(self, F, x):
        residual = x

        x = self.body(x)

        if self.downsample:
            residual = self.downsample(residual)

        x = F.Activation(residual+x, act_type='relu')

        return x

This code creates a new Block using preexisting Gluon blocks. But it does more than just running the data through some preexisting blocks. Given some data, the block runs the data through the body block aways. But then, runs the data through downsample only if this Block was created with downsample set to true. It then concats the output of body and downsample to create the output. Like you can see there is more happening than just passing data through a sequence of Blocks. This is when you create your own block by subclassing HybridBlock or Block.

Note that the __init__ function created the necessary blocks and forward function gets the inputs and runs the input through the blocks created in __init__. forward does not modify the blocks created in __init__. It only runs the data through the blocks created in __init__.

Similarly, in the example you quoted, the first code block creates blocks like downsamplers, class_predictors, box_predictors. The forward functions in code block 2 and 3 do not modify those blocks. They merely pass the input data through those blocks.


Thanks for the introduction of the symbolic and imperative programming. I infer that the ‘mxnet’ is trying to combine the advantages from the symbolic programming and the imperative programming. Is it the derivation of name ‘mxnet’?

It is exciting that I can draw a specific layer out by using indexing when it is built with ‘Sequential’. A tutorial in ‘mxnet’ home page shows a similar example ( But, a layer with designed forward function cannot support indexing. net[0]->>‘ToySSD’ object does not support indexing Do I misunderstand something?

Can I retrieve the intermediate output of a specific layer such as the weights by using indexing?


Yes. Here is a quote from the MXNet paper:

Our combined new effort resulted in MXNet (or “mix-net”), intending to blend advantages of different approaches. Declarative programming offers clear boundary on the global computation graph, discovering more optimization opportunity, whereas imperative programs offers more flexibility.

Intermediate output is hard to access because intermediate outputs might not be saved for optimization purposes. Is there any specific reason you want to do this?

Weights can be accessed using collect_params function.

For example,
net.collect_params()['resnetv11_conv0_weight'].data() will get the weights of the first convolution layer of ResNet.


One may to access the output of a specific layer is to perform a bypass operation on the original network with ‘SymbolBlock’. The documentation of ‘SymbolBlock’ contains an example, and a more complicated example is ‘gluoncv.nn.feature.FeatureExpander’. The latter is used to build a Single Shot MultiBox Detector in ‘’. This function is useful but less straightforward. It would be better if there is a graphical interface to manipulate these blocks, e.g. similar to Simulink.


Another reason is that mxNet magically processes the derivative. In MATLAB, Maple or Mathematica, a symbol equation is designed when we have written it down. However, a symbol equation is designing while one is performing computing in mxNet. When you put something in ‘mxnet.gluon.nn.Sequential()’ or ‘mxnet.gluon.nn.HybridSequential()’, mxNet do not know how to calculate the derivative until you show it how to perform a forward computing. The documentation contains an example at ‘’.

By the way, the documentation is a chaos. An example of excellent documentation is the Mathematica’s. One can test the example of documentation in the documentation itself. Another example is the MATLAB’s, one can search the documentation with ambiguous search terms. However, I have to read the code to understand the usage of mxNet’s function.