Transformer

https://d2l.ai/chapter_attention-mechanism/transformer.html

I found this implementation somewhat contradictory to the text description.

  1. Shall we create num_heads copies of W_q, W_k, W_v? Or equivalently use hidden_size*num_heads as the first parameter to the dense layers like W_q, W_k, W_v?
  2. Shall we have num_heads copies of attention layers, instead of one?