The link does work with error:
403 Forbidden
Code: AccessDenied
Message: Access Denied
It should work now! Thanks for pointing it out!
I think there is an problem between the Multiulayer Perceptron Attention definition and the implementation:
\exists x,y \in \mathbb{R}: tanh(x+y) \neq tanh(x)+ tanh(y)
The reason I say that it’s because after the sum we are basically adding the tanh(x)+ tanh(y) together rather than adding them first adding and then applying tanh.
Please let me know if my understanding is wrong!
Is there a problem with notation inside the code of MLPAttention?
The code is listed as
query, key = self.W_k(query), self.W_q(key)
But the math notation tells me it should be
query, key = self.W_k(key), self.W_q(query)
Not that it will break anything, but for clarity’s sake. Is there anything I am missing?
Can someone please explain why, in the code below, valid_len is set equal to the batch_size? From the masked_softmax
code, I can understand the role of valid_len
, but it is not clear why it should be equal to the batch_size
. An example would be great.
#@save
class DotProductAttention(nn.Block):
def __init__(self, dropout, **kwargs):
super(DotProductAttention, self).__init__(**kwargs)
self.dropout = nn.Dropout(dropout)
# query: (batch_size, #queries, d)
# key: (batch_size, #kv_pairs, d)
# value: (batch_size, #kv_pairs, dim_v)
# valid_len: either (batch_size, ) or (batch_size, xx)
def forward(self, query, key, value, valid_len=None):
d = query.shape[-1]
# Set transpose_b=True to swap the last two dimensions of key
scores = npx.batch_dot(query, key, transpose_b=True) / math.sqrt(d)
attention_weights = self.dropout(masked_softmax(scores, valid_len))
return npx.batch_dot(attention_weights, value)
@vahuja4, recall that a valid length is the length of a sequence without appended padding tokens. In our code, valid_len
is not a single number but rather an array (or even matrix) of shape (batch_size, )
. So, all the comments describe the shapes of the formal parameters of forward
, not their values. In particular, valid_len
must be of shape (batch_size,)
, i.e. for every entry in a batch, specify a valid length; e.g. batch_size = 4
, valid_len = np.array([3, 5, 5, 2])
.
What about the case when valid_len
is a matrix? I guess it is expected to be of shape (batch_size, #queries)
.
the pytorch MLP attention version is missing the tanh operator on the sum
also, is there any reason why the bias term is not added in the MLPAttention?