I have an interesting (maybe?) use case. I don’t think it’s been covered in previous threads, but apologies if this is a duplicate.

To make the problem concrete, let me first describe a model. Suppose I have the following softmax model with feature function F and weights W; given x, the probability of label y is:

```
h(x, y) = exp( F(x,y) * W ) / ( sum_{y'} exp( F(x,y') * W ) )
```

Note that F returns features of both the input x and the output y. A simple example is when y is a one-hot encoding of the label and F(x, y) returns the Kronecker product of x and y. In this case, the model is equivalent to a regular logit model. But you can imagine a case where F returns more complex features, such as if y has structure/attributes.

The key complication is this: **for any given input, the eligible labels may change.** For example, maybe certain labels are incompatible with certain inputs. With Y(x) denoting the eligible labels for x, the output is defined as:

```
h(x, y) = exp( F(x,y) * W ) / ( sum_{y' in Y(x)} exp( F(x,y') * W ) )
```

where y is assumed to be in Y(x). Note that the partition function (sum in the denominator) changed.

**My question to the forum is: what is the best, most efficient way to implement this in MXNet?**

We can assume that the maximum number of eligible labels is bounded, which means that we can just treat this as the original model with appropriate padding/masking. Accordingly, the features can be precomputed and stored in a matrix:

```
Fxy = nd.array(batch_size, max_num_labels, num_features)
```

Maybe sparse data structures would help if the eligible labels are sparse. Anyway, what is the ** right** way to mask the unused features? Is there a built-in MXNet feature? Or should is zero padding the best way?

Given Fxy, predicting the preactivations is simple:

```
net = Dense(max_num_labels)
preact = net(nd.concat(*Fxy, dim=0))
```

Masking is needed when computing the softmax. Mathematically, we can multiply the preactivation output by -inf wherever the label is unavailable. This seems to work in mxnet; e.g.,

```
>>> nd.softmax(nd.array([1, 1, -np.inf]))
[0.5 0.5 0. ]
<NDArray 3 @cpu(0)>
```

Is there a built-in mask for the softmax function? If not, it would be nice to have.

The problem with padding/masking is that it’s inefficient. It would make more sense to store the features as a list of 2-D arrays:

```
Fx = [nd.array(num_eligible_labels[i], num_features) for i in range(batch_size)]
```

As far as I know, this can’t be converted to an NDArray, so it can’t be loaded onto the GPU, and one can’t do batch prediction. But is anything *like* this possible?

== Ben