Attention

https://d2l.ai/chapter_attention-mechanisms/attention.html

The link does work with error:

403 Forbidden

Code: AccessDenied
Message: Access Denied

@TristonC, I think that link now is this:

http://d2l.ai/chapter_attention-mechanisms/index.html

2 Likes

It should work now! Thanks for pointing it out!

I think there is an problem between the Multiulayer Perceptron Attention definition and the implementation:

\exists x,y \in \mathbb{R}: tanh(x+y) \neq tanh(x)+ tanh(y)

The reason I say that it’s because after the sum we are basically adding the tanh(x)+ tanh(y) together rather than adding them first adding and then applying tanh.

Please let me know if my understanding is wrong!

Is there a problem with notation inside the code of MLPAttention?
The code is listed as
query, key = self.W_k(query), self.W_q(key)
But the math notation tells me it should be
query, key = self.W_k(key), self.W_q(query)
Not that it will break anything, but for clarity’s sake. Is there anything I am missing?

1 Like

Can someone please explain why, in the code below, valid_len is set equal to the batch_size? From the masked_softmax code, I can understand the role of valid_len, but it is not clear why it should be equal to the batch_size. An example would be great.

#@save
class DotProductAttention(nn.Block):
    def __init__(self, dropout, **kwargs):
        super(DotProductAttention, self).__init__(**kwargs)
        self.dropout = nn.Dropout(dropout)

    # query: (batch_size, #queries, d)
    # key: (batch_size, #kv_pairs, d)
    # value: (batch_size, #kv_pairs, dim_v)
    # valid_len: either (batch_size, ) or (batch_size, xx)
    def forward(self, query, key, value, valid_len=None):
        d = query.shape[-1]
        # Set transpose_b=True to swap the last two dimensions of key
        scores = npx.batch_dot(query, key, transpose_b=True) / math.sqrt(d)
        attention_weights = self.dropout(masked_softmax(scores, valid_len))
        return npx.batch_dot(attention_weights, value)

@vahuja4, recall that a valid length is the length of a sequence without appended padding tokens. In our code, valid_len is not a single number but rather an array (or even matrix) of shape (batch_size, ). So, all the comments describe the shapes of the formal parameters of forward, not their values. In particular, valid_len must be of shape (batch_size,), i.e. for every entry in a batch, specify a valid length; e.g. batch_size = 4, valid_len = np.array([3, 5, 5, 2]).

What about the case when valid_len is a matrix? I guess it is expected to be of shape (batch_size, #queries).

the pytorch MLP attention version is missing the tanh operator on the sum

also, is there any reason why the bias term is not added in the MLPAttention?