Hey, all! reading attention and I've gotten stuck,...

# 06-technical-discussionb

ben

07/03/2023, 6:39 PMHey, all! reading attention and I've gotten stuck, looking for advice:
why do we believe the query and key matrices are learning distinct sets of information?
here are my assumptions:
1. the dataflow looks exactly symmetric between the two, you could theoretically rename query and key with each others names. so in backprop, they have the same gradient flow. This looks pretty clear from the softmax(QK^T) part
1A. In the attention paper, the fact that during decoder blocks, the queries come from the encoder side and the keys come from the decoder side could make a difference here? But in decoder only transformers I'm not sure this applies
2. if they have identical dataflow & backprop, then information could theoretically propagate to either the decoder or encoder side arbitrarily. So, information about "what x is looking for" and "what x offers" could just as easily get encoded in either matrix.
3. If (2) is true, then why might they not just both learn a mixture of "what x offers" and "what x is looking for" -- and thus not learn distinct sets of information at all? Do we have ablations to prove that the "query" corresponds to our common notion of "what x is looking for"?
Please, I'm very open to flaws in my reasoning, not a researcher lol, just product eng trying to get the grasp of this. any help appreciated đź™Ź đź™Ź

đź‘€ 1

a

Ajay Arasanipalai

07/03/2023, 10:20 PMthe dataflow looks exactly symmetric between the twoNot

`Q`

and `K`

are identical, but they queries and keys are obtained using different sets of projection matrices/weights. So the computation might look like:
Copy code

```
Q = q_proj @ x
K = (k_proj @ x).T
```

And `q_proj.shape == k_proj.shape`

(for bidirectional self-attention), but `q_proj != k_proj`

.
you could theoretically rename query and key with each others namesAs long as they're the same shape, and you're using the standard self-attention (so no multi-query, etc.) yeah you could. Again, the computation is identical, what makes them different is the projection weights. So you

`Q`

and `K`

like above, then "forget" about that and swap all the uses of `Q`

and `K`

throughout your code - the model would still work, because this is equivalent to just changing the variable names.
What you `Q`

and `K`

at inference time.đź‘Ť 2

l

Leon Wu

07/03/2023, 10:29 PMalso i don't think it's exactly symmetrical, because of the softmax
If you write out the matrix multiply, letting context length = T, Q and K being dimension T x d_head, and let q_i to be row i of Q and k_j to be row j of K
you find that QK^T is a matrix that looks like

Copy code

```
q_1 * k_1 , q_1 * k_2, q_1 * k_3 ... q_1 * k_T
q_2 * k_1 ...
...
q_T * k_1 ... q_T * k_T
```

and then the softmax is applied on each row of the resulting T x T matrix separately, i.e. softmax across the dot products with each key, holding query constant (so q and k end up playing different roles)a

Ajay Arasanipalai

07/03/2023, 10:50 PMwhy might they not just both learn a mixture of "what x offers" and "what x is looking for"They could! The whole "what x offers" and "what x is looking for" thing is mostly just an analogy that's useful for building an intuition. For example, at least in the simplest version attention, there's no explicit loss term or architectural decision that forces

`Q`

to be more "query-like" and `K`

more "key-like" - they're just words to ascribe some meaning to a soup of high-dimensional linear algebra. Even the word "attention" is somewhat anthropomorphic - the attention matrix simply represents a distribution over value vectors.âť¤ď¸Ź 2

b

ben

07/03/2023, 10:52 PMoh shoot i mistyped â€” â€ś information could theoretically propagate to either the decoder or encoder side arbitrarilyâ€ť
i meant information could theoretically propagate to either the query or key side arbitrarily

ben

07/03/2023, 10:56 PM2 Views