Attention
Attention maps a query and key-value pairs to an output. The model compares the query with each key, then uses the resulting weights to average the values.
We call our particular attention Scaled Dot-Product Attention. Queries and keys have dimension dk; values have dimension dv.
In implementation, the attention scores are produced by multiplying the query matrix by the transposed key matrix, scaling by sqrt(dk), and normalizing with softmax.
scores = torch.matmul(query, key.transpose(-2, -1))
scores = scores / math.sqrt(query.size(-1))
p_attn = scores.softmax(dim=-1)
return torch.matmul(p_attn, value)
Softmax turns the scaled scores into attention weights, so the final output is a weighted mixture of the value vectors.