ben
07/03/2023, 6:39 PMAjay Arasanipalai
07/03/2023, 10:20 PMthe dataflow looks exactly symmetric between the twoNot exactly. The operations to materialize
Q
and K
are identical, but they queries and keys are obtained using different sets of projection matrices/weights. So the computation might look like:
Q = q_proj @ x
K = (k_proj @ x).T
And q_proj.shape == k_proj.shape
(for bidirectional self-attention), but q_proj != k_proj
.
you could theoretically rename query and key with each others namesAs long as they're the same shape, and you're using the standard self-attention (so no multi-query, etc.) yeah you could. Again, the computation is identical, what makes them different is the projection weights. So you could initialize
Q
and K
like above, then "forget" about that and swap all the uses of Q
and K
throughout your code - the model would still work, because this is equivalent to just changing the variable names.
What you can't do, for example, is train the usual way then swap uses of Q
and K
at inference time.Leon Wu
07/03/2023, 10:29 PMq_1 * k_1 , q_1 * k_2, q_1 * k_3 ... q_1 * k_T
q_2 * k_1 ...
...
q_T * k_1 ... q_T * k_T
and then the softmax is applied on each row of the resulting T x T matrix separately, i.e. softmax across the dot products with each key, holding query constant (so q and k end up playing different roles)Ajay Arasanipalai
07/03/2023, 10:50 PMwhy might they not just both learn a mixture of "what x offers" and "what x is looking for"They could! The whole "what x offers" and "what x is looking for" thing is mostly just an analogy that's useful for building an intuition. For example, at least in the simplest version attention, there's no explicit loss term or architectural decision that forces
Q
to be more "query-like" and K
more "key-like" - they're just words to ascribe some meaning to a soup of high-dimensional linear algebra. Even the word "attention" is somewhat anthropomorphic - the attention matrix simply represents a distribution over value vectors.ben
07/03/2023, 10:52 PM