The link does work with error:

403 Forbidden

```
Code: AccessDenied
Message: Access Denied
```

It should work now! Thanks for pointing it out!

I think there is an problem between the *Multiulayer Perceptron Attention* definition and the implementation:

\exists x,y \in \mathbb{R}: tanh(x+y) \neq tanh(x)+ tanh(y)

The reason I say that it’s because after the sum we are basically adding the tanh(x)+ tanh(y) together rather than adding them first adding and then applying tanh.

Please let me know if my understanding is wrong!

Is there a problem with notation inside the code of MLPAttention?

The code is listed as

query, key = self.W_k(query), self.W_q(key)

But the math notation tells me it should be

query, key = self.W_k(key), self.W_q(query)

Not that it will break anything, but for clarity’s sake. Is there anything I am missing?

Can someone please explain why, in the code below, valid_len is set equal to the batch_size? From the `masked_softmax`

code, I can understand the role of `valid_len`

, but it is not clear why it should be equal to the `batch_size`

. An example would be great.

```
#@save
class DotProductAttention(nn.Block):
def __init__(self, dropout, **kwargs):
super(DotProductAttention, self).__init__(**kwargs)
self.dropout = nn.Dropout(dropout)
# query: (batch_size, #queries, d)
# key: (batch_size, #kv_pairs, d)
# value: (batch_size, #kv_pairs, dim_v)
# valid_len: either (batch_size, ) or (batch_size, xx)
def forward(self, query, key, value, valid_len=None):
d = query.shape[-1]
# Set transpose_b=True to swap the last two dimensions of key
scores = npx.batch_dot(query, key, transpose_b=True) / math.sqrt(d)
attention_weights = self.dropout(masked_softmax(scores, valid_len))
return npx.batch_dot(attention_weights, value)
```

@vahuja4, recall that a **valid length** is the length of a sequence without appended padding tokens. In our code, `valid_len`

is not a single number but rather an *array* (or even *matrix*) of *shape* `(batch_size, )`

. So, all the comments describe the *shapes* of the formal parameters of `forward`

, not their *values*. In particular, `valid_len`

must be of *shape* `(batch_size,)`

, i.e. for every entry in a batch, specify a valid length; e.g. `batch_size = 4`

, `valid_len = np.array([3, 5, 5, 2])`

.

What about the case when `valid_len`

is a matrix? I guess it is expected to be of shape `(batch_size, #queries)`

.

the pytorch MLP attention version is missing the tanh operator on the sum

also, is there any reason why the bias term is not added in the MLPAttention?