GPU memory in training vs bind/load


Hello everybody,

I have a model that, while being served, takes up about 2.4GB GPU memory.

However, when training, the memory requirement is boosted up to more than 8GB.

I found that, when defining the model, initializing the Module() and binding (via bind(for_training=True), so I understand the executors are already there) already takes up some 4GB. However, when actually calling the .fit() method, some additional 4GB are reserved.

My problem is that, with this amount of memory, I need a GPU with four times as much memory for training than for actually serving the model, which implies that scalability is severely affected, and is actually borderline with many GPUs which feature 11GB.

Any hints about this? Why does training take up so much GPU memory? Is there a way around this?

Thanks in advance!


Here’s a good article talking about what’s taking up the memory. In the mean time, are you able to provide more details? What batch size are you using? What architecture? Can you use a simpler architecture or reduce batch size?

Observation: Who occupies the memory?

We can divide the the data in GPU memory into 4 categories according to their functionalities:
– Model Parameters (Weights)
– Feature Maps
– Gradient Maps
– Workspace



Thanks for your reply, Vishaal, and also thanks for the interesting link, which did provide me a very good idea about what’s eating up memory - very helpful!

I believe offloading and then prefetching again would necessarily happen beyond the scope of the MXNet APIs, right?

Regarding the specific model, it’s actually a very simplistic model. Here’s the core of it (it’s a recommender system):

    user = mx.symbol.Variable("user")
    item = mx.symbol.Variable("item")
    score = mx.symbol.Variable("softmax_label")
    user_embed = mx.symbol.Embedding(name="user_embed", data=user,
                                         input_dim=max_users, output_dim=embed_size)
    item_embed = mx.symbol.Embedding(name="item_embed", data=item,
                                         input_dim=max_items, output_dim=embed_size)
    user = mx.symbol.L2Normalization(user_embed)
    item = mx.symbol.L2Normalization(item_embed)

    dot = user * item

    dot = mx.symbol.sum(dot, axis=1)
    dot = mx.symbol.Flatten(dot)
    pred = mx.symbol.LinearRegressionOutput(data=dot, label=score)

The purpose of keeping the model so very simple is basically to keep the embeddings with a high “semantic” load. Also, embed_size is fixed to 64 (according to experimentation, it can’t go much lower).

The problem is, I believe, not really within the topology itself, but rather with the amount of embeddings: with some ~7M items and ~200k users (max_items and max_users, respectively), the embedding matrices grow very large (the model itself uses about 2.5GB). I could be using sparse embeddings, but those would save me less than 10% of the total memory currently being used (10% means in this case that I will stumble into the memory problem again in some months, as user/item base grows).

I’m currently using a quite large batch size (50k batch size), but running some tests revealed that reducing batch size helps only to a very limited extent (again, in the 10% range).

I have been exploring using mixed precision training (according to, but so far no luck… In fact, if I use the multi_precision=True flag within the optimizer, it actually seems to use up more memory (!). I’m still exploring this, however.

Thanks for your message!