Poorly Structured Notes on AI Part 3
So what is the significance of the “attention” pattern? The three matrices have names: Query ($Q$), Key ($K$) and Value ($V$).
Our inputs are vectors, $\vec{I}$, and we use the matrices as operators to do this:
\[A_{\mu\nu} = \hat{Q}\vec{I_{\mu}} \cdot \hat{K}\vec{I_{\nu}}\]The dot product is at its positive maximum when the two vectors point in the same direction, at its negative minimum when they point in the exact opposite direction, and is zero when they are orthogonal. So we’re performing that comparison between every possible pairing of the (transformed) input vectors, to produce a square matrix the same size as the input token stream.
Why perform a different transformation on the two sides of the dot product? This makes it asymmetrical. If the same matrix (say, $\hat{Q}$) was applied to both sides, $A$ would end up a symmetric matrix, nearly half its contents redundant. Using the whole matrix, breaking the symmetry, means that the effect of token $a$ on token $b$ is different from the effective of $b$ on $a$. The matrices $Q$ and $K$ twist and flip each input vector differently.
And then (after $\text{softmax}$ fixes it up) $A$ goes to work on the vectors $\hat{V} \vec{I}_{\nu}$, that is, the input vectors again flipped around by another matrix.
Can we visualise this as a network with information flowing? It’s kind of a struggle. First, it’s a network that springs into existence temporarily, rather than a persistent structure in the model. Each token of the input is a node, and so the number of nodes depends on the length of the input. And what are the edges, and what travels along them?
The names of the matrices hint at their specific purposes: $Q$(uery) is asking a question. e.g.
The query ($Q$) can be thought of as a representation of the current task at hand (e.g., “What word follows too?”).
— Foster, David. Generative Deep Learning
Whereas $K$(ey) is like an identifier in a mapping from keys to values:
The key vectors ($K$) are representations of each word in the sentence—you can think of these as descriptions of the kinds of prediction tasks that each word can help with. They are derived in a similar fashion to the query…
— ibid.
But not just in a similar fashion - in an identical fashion. It’s up to the model, in whatever happens during the training process, to distribute information between these matrices, and they work identically, so it seems a little fanciful to dress them up with these specific roles (however vaguely defined).
Really I think all we can say is that by pre-transforming the input vectors by two different operators, the symmetry of the dot product is deliberately broken and the resulting $A$ matrix is enriched with as much information as possible, by not being a symmetric matrix.
The value vectors ($V$) are also representations of the words in the sentence — you can think of these as the unweighted contributions of each word.
– ibid.
In a traditional fully-connected layer, we can think of it having a plastic network structure such that as it is trained, the strengths of the connections between nodes is gradually modified, some may weaken to almost nothing, and so there is eventually a specific persistent network that holds knowledge. We can picture it topologically.
But here, the network (if it’s appropriate to think of it in that way at all) is an ephemeral thing that springs into existence, the number of nodes depending on the number of tokens in the input, their influence on each other being described by $A$, which is custom-generated to suit this length of input. It’s very different from an old-school artificial neural network, and very many steps removed from anything in the brain.
And we’re not even part of the way there yet!
Not yet regretting the time you've spent here?
Keep reading: