Hey all reading attention and I ve gotten stuck looking for Cerebral Valley #06-technical-discussion

Hey, all! reading attention and I've gotten stuck,...

ben

07/03/2023, 6:39 PM

Hey, all! reading attention and I've gotten stuck, looking for advice: why do we believe the query and key matrices are learning distinct sets of information? here are my assumptions: 1. the dataflow looks exactly symmetric between the two, you could theoretically rename query and key with each others names. so in backprop, they have the same gradient flow. This looks pretty clear from the softmax(QK^T) part 1A. In the attention paper, the fact that during decoder blocks, the queries come from the encoder side and the keys come from the decoder side could make a difference here? But in decoder only transformers I'm not sure this applies 2. if they have identical dataflow & backprop, then information could theoretically propagate to either the decoder or encoder side arbitrarily. So, information about "what x is looking for" and "what x offers" could just as easily get encoded in either matrix. 3. If (2) is true, then why might they not just both learn a mixture of "what x offers" and "what x is looking for" -- and thus not learn distinct sets of information at all? Do we have ablations to prove that the "query" corresponds to our common notion of "what x is looking for"? Please, I'm very open to flaws in my reasoning, not a researcher lol, just product eng trying to get the grasp of this. any help appreciated 🙏 🙏

👀 1

Ajay Arasanipalai

07/03/2023, 10:20 PM

the dataflow looks exactly symmetric between the two

Not exactly. The operations to materialize

and

are identical, but they queries and keys are obtained using different sets of projection matrices/weights. So the computation might look like:

Copy code

Q = q_proj @ x
K = (k_proj @ x).T

And

q_proj.shape == k_proj.shape

(for bidirectional self-attention), but

q_proj != k_proj

you could theoretically rename query and key with each others names

As long as they're the same shape, and you're using the standard self-attention (so no multi-query, etc.) yeah you could. Again, the computation is identical, what makes them different is the projection weights. So you could initialize

and

like above, then "forget" about that and swap all the uses of

and

throughout your code - the model would still work, because this is equivalent to just changing the variable names. What you can't do, for example, is train the usual way then swap uses of

and

at inference time.

👍 2

Leon Wu

07/03/2023, 10:29 PM

also i don't think it's exactly symmetrical, because of the softmax If you write out the matrix multiply, letting context length = T, Q and K being dimension T x d_head, and let q_i to be row i of Q and k_j to be row j of K you find that QK^T is a matrix that looks like

Copy code

q_1 * k_1 , q_1 * k_2, q_1 * k_3 ... q_1 * k_T
q_2 * k_1 ...
...
q_T * k_1 ... q_T * k_T

and then the softmax is applied on each row of the resulting T x T matrix separately, i.e. softmax across the dot products with each key, holding query constant (so q and k end up playing different roles)

Ajay Arasanipalai

07/03/2023, 10:50 PM

why might they not just both learn a mixture of "what x offers" and "what x is looking for"

They could! The whole "what x offers" and "what x is looking for" thing is mostly just an analogy that's useful for building an intuition. For example, at least in the simplest version attention, there's no explicit loss term or architectural decision that forces

to be more "query-like" and

more "key-like" - they're just words to ascribe some meaning to a soup of high-dimensional linear algebra. Even the word "attention" is somewhat anthropomorphic - the attention matrix simply represents a distribution over value vectors.

❤️ 2

ben

07/03/2023, 10:52 PM

oh shoot i mistyped — “ information could theoretically propagate to either the decoder or encoder side arbitrarily” i meant information could theoretically propagate to either the query or key side arbitrarily

ben

07/03/2023, 10:56 PM

@Ajay Arasanipalai @Leon Wu thank you!! ah, that was my basic question: “aren’t these query-like and key-like stories about the model made up?” glad to hear i’m not alone in not fully buying the human readable version of the story

2 Views

Open in Slack

Previous Next