I found this implementation somewhat contradictory to the text description.
- Shall we create num_heads copies of W_q, W_k, W_v? Or equivalently use hidden_size*num_heads as the first parameter to the dense layers like W_q, W_k, W_v?
- Shall we have num_heads copies of attention layers, instead of one?
Yeah. I also think the implementation for multi-head attention is wrong here and what you suggest is correct.
I think the implementation has the same effect as the text description.
1.For W_q,W_k,W_v, the implementation here use dense layer of hidden_size*num_heads units, while the paper use num heads copies of dense layer of hidden_size units. I think the ith copy of dense layer is equivalent to the [(i-1) *num_hidden_size:i *num_hidden_size] units in the big dense layer. There is the same effect of back propagation in training using one layer or num heads copies of small layer.
2. For attention layers,transformer use Dotproduct attention which has no parameters to train, it just compute matrix multiplication and dotproduct of inputs. Using one or more has nothing different.
@gold_piggy the link does not work. It shows following error.
403 Forbidden Code: AccessDenied Message: Access Denied RequestId: 1FC5C3A1976FBE23 HostId: I8mK3i9om5k4II980EvBAYXCdYigJcQgCr32AVziVpRDxsGu6kf8yVaikbMnU9F3dzYeAzWiib0= An Error Occurred While Attempting to Retrieve a Custom Error Document Code: AccessDenied Message: Access Denied