Poorly Structured Notes on AI Part 2

The input vectors $\vec{I}_{n}$ to the transformer are therefore vectors consisting of $D_{model}$ numbers, found by vector addition of the embedding vectors $\vec{E}_n$ for each token to the position vector $\vec{P}_n$ for the token’s position $n$ in the input stream:

\[\vec{I}_n = \vec{E}_n + \vec{P}_n\]

The $\vec{P}_n$ only depends on $n$, and requires that $D_{model}$ is an even number because it populates the dimensions in pairs, each pair taking on the coordinates of a point tracing out a circle at a different frequency.

Transformer implementations vary in their details. They may have an encoder, or a decoder, or both. The purpose of these components is not obvious until you’ve seen how they work independently or together in actual applications.

We’ll start with the encoder. The $\vec{I}_n$ inputs are fed into the first of several layers. The layers are structurally identical black boxes, but maintain their own parameters. As the model is trained, knowledge is distributed across all the layers.

A layer has three matrices, $Q_{ij}$, $K_{ij}$ and $V_{ij}$. To simplify, we’ll assume they are square $D_{model} \times D_{model}$. In real implementations this part is done in parallel by multiple “heads”, each working on a slice of the dimensions, so these matrices will be $D_{model} \times (D_{model}/h)$ where $h$ is the number of heads, and each head has its own triplet of matrices. We can note that this is possible and then forget about it as it’s an optimisation; $h$ can be set to $1$.

The three matrices are used as operators to transform each input vector into a new vector which we’ll name after the matrix but in lowercase:

\[\vec{q} = Q \vec{I}\] \[\vec{k} = K \vec{I}\] \[\vec{v} = V \vec{I}\]

If you prefer abstract tensor notation you can think of the input as a matrix $I_{nd}$ with a row for each token and a column for each embedding vector dimension, so to get the $d$-th coordinate of the $n$-th projected $\vec{q}$ vector:

\[q_{nd} = Q_{kd} I_{nk}\]

The nice thing about that notation is that it literally tells you (with no need to remember how matrix multiplication works!) which ordinary numbers to multiply, summing over the values of $k$ in the range $D_{model}$, thus computing one element of a matrix $q_{nd}$ at row $n$ and column $d$, which we can also think of as a list of $n$ vectors $\vec{q}_n$.

Next the layer computes the attention weights $A_{\alpha\beta}$, where those indices $\alpha, \beta$ range over the number of tokens in the input. So this is a variable size square matrix relating every token to every other token in the input. Therefore it is immediately clear that it isn’t stored anywhere (the model has a fixed number of parameters, so it can’t depend on the length of a given stream of input.) It’s a “temporary variable” of this model iteration, used once and discarded.

The first step to computing these weightings is to get scores $\sigma_{\alpha\beta}$:

\[\sigma_{\alpha\beta} = q_{\alpha\gamma} k_{\gamma\beta}\]

Or if you prefer, generate a list of vectors of the same length as the input token stream, the vector dimensions also being length of the input token stream, by taking the dot product of each $\vec{q}$ with every $\vec{k}$.

Now, because of what we’re going to do next, we want to keep the scores in a small range, so their variance doesn’t depend on the models chosen $D_{model}$. Intuitively, suppose you have a set of numbers that are all either $1$ or $-1$ randomly (like coin flips). If you sum them, that’s like taking a random walk up and down the number line, starting at $0$. It’s a famous result that your expected distance from the origin will be $\sqrt{N}$ after $N$ steps. What we’re saying here is that we want the distance to be $1$, regardless of the number of steps, and the dot product requires us to add $D_{model}$ numbers, so we need to divide the score values by $\sqrt{D_{model}}$.

And now we are ready to use the $\text{softmax}$ function. This is widely used in machine learning and is sometimes described as producing a “probability distribution” but it does nothing of the sort. Given a vector, it returns another vector whose coordinates are all in the interval $(0, 1)$ and sum to $1$, so they certainly look as if they could be a probability distribution. Note that this is not the same as normalising the vector (making the squares of its components sum to $1$.)

It does this by exponentiating each coordinate and dividing it by the the sum of all the exponentiated coordinates. Exponentiation has the effect of blowing up large values and shrinking small values, so it tends to single out and exaggerate a few spiky coordinates while suppressing others. This is one reason for normalising the variance of the raw scores, to lessen this effect. (Note that $\text{softmax}$ is applied separately to each scaled score vector, so the coordinates sum to $1$ within one vector, not over all coordinates of all vectors.)

And so by applying $\text{softmax}$ on each row of attention scores, we finally have our attention weights, and we can use them to transform the $\vec{v}$ vectors, which are the only ones we still haven’t made use of:

\[o_{nd} = A_{n\lambda} v_{\lambda d}\]

Once again, think of $\lambda$ being the summation variable iterating over the length of the input token stream.

This concludes the “attention” part of what the layer does, but we’re not done yet. We haven’t introduced nearly enough parameters to support NVIDIA’s share price. But it’s bedtime again.

Not yet regretting the time you've spent here?