https://cerebralvalley.ai logo
#06-technical-discussion
Title
# 06-technical-discussion
b

ben

07/03/2023, 6:39 PM
Hey, all! reading attention and I've gotten stuck, looking for advice: why do we believe the query and key matrices are learning distinct sets of information? here are my assumptions: 1. the dataflow looks exactly symmetric between the two, you could theoretically rename query and key with each others names. so in backprop, they have the same gradient flow. This looks pretty clear from the softmax(QK^T) part 1A. In the attention paper, the fact that during decoder blocks, the queries come from the encoder side and the keys come from the decoder side could make a difference here? But in decoder only transformers I'm not sure this applies 2. if they have identical dataflow & backprop, then information could theoretically propagate to either the decoder or encoder side arbitrarily. So, information about "what x is looking for" and "what x offers" could just as easily get encoded in either matrix. 3. If (2) is true, then why might they not just both learn a mixture of "what x offers" and "what x is looking for" -- and thus not learn distinct sets of information at all? Do we have ablations to prove that the "query" corresponds to our common notion of "what x is looking for"? Please, I'm very open to flaws in my reasoning, not a researcher lol, just product eng trying to get the grasp of this. any help appreciated 🙏 🙏
👀 1
a

Ajay Arasanipalai

07/03/2023, 10:20 PM
the dataflow looks exactly symmetric between the two
Not exactly. The operations to materialize
Q
and
K
are identical, but they queries and keys are obtained using different sets of projection matrices/weights. So the computation might look like:
Copy code
Q = q_proj @ x
K = (k_proj @ x).T
And
q_proj.shape == k_proj.shape
(for bidirectional self-attention), but
q_proj != k_proj
.
you could theoretically rename query and key with each others names
As long as they're the same shape, and you're using the standard self-attention (so no multi-query, etc.) yeah you could. Again, the computation is identical, what makes them different is the projection weights. So you could initialize
Q
and
K
like above, then "forget" about that and swap all the uses of
Q
and
K
throughout your code - the model would still work, because this is equivalent to just changing the variable names. What you can't do, for example, is train the usual way then swap uses of
Q
and
K
at inference time.
👍 2
l

Leon Wu

07/03/2023, 10:29 PM
also i don't think it's exactly symmetrical, because of the softmax If you write out the matrix multiply, letting context length = T, Q and K being dimension T x d_head, and let q_i to be row i of Q and k_j to be row j of K you find that QK^T is a matrix that looks like
Copy code
q_1 * k_1 , q_1 * k_2, q_1 * k_3 ... q_1 * k_T
q_2 * k_1 ...
...
q_T * k_1 ... q_T * k_T
and then the softmax is applied on each row of the resulting T x T matrix separately, i.e. softmax across the dot products with each key, holding query constant (so q and k end up playing different roles)
a

Ajay Arasanipalai

07/03/2023, 10:50 PM
why might they not just both learn a mixture of "what x offers" and "what x is looking for"
They could! The whole "what x offers" and "what x is looking for" thing is mostly just an analogy that's useful for building an intuition. For example, at least in the simplest version attention, there's no explicit loss term or architectural decision that forces
Q
to be more "query-like" and
K
more "key-like" - they're just words to ascribe some meaning to a soup of high-dimensional linear algebra. Even the word "attention" is somewhat anthropomorphic - the attention matrix simply represents a distribution over value vectors.
❤️ 2
b

ben

07/03/2023, 10:52 PM
oh shoot i mistyped — “ information could theoretically propagate to either the decoder or encoder side arbitrarily” i meant information could theoretically propagate to either the query or key side arbitrarily
@Ajay Arasanipalai @Leon Wu thank you!! ah, that was my basic question: “aren’t these query-like and key-like stories about the model made up?” glad to hear i’m not alone in not fully buying the human readable version of the story
2 Views