Poorly Structured Notes on AI Part 1

An AI model has a lot of parameters organised into matrices. These could be initially assigned random values. The training process will adjust them so they ultimately represent knowledge in some sense.

Considering the example of handling written words, the first step is to associate every word in the vocabulary with a unique integer. So we just add every new word we encounter to a big ol’ list, so their position in the list is their unique number. Note that in practice we might not work at the level of words; more generally we talk about tokens, which could be parts of words or groups of adjacent words, so we’ll switch to saying tokens from now on. But feel free to picture them as whole words from the dictionary.

We build our massive table of all tokens by tokenizing our entire training data set (say, Wikipedia and some copyrighted books we illegally downloaded.) So now we have a list of $T$ known tokens, all numbered, and we can convert any text into a sequence of numbers.

Then we get the first hint of the curious importance of geometry in this discipline. We maintain a vector associated with each token with ever encountered. Each vector is represented by an array of numbers. How long is the array? A fixed length that is a permanent characteristic of the model, called $D_{model}$ - $D$ is for dimensions, as this the number of dimensions of one of these vectors, which we will find is an even number.

But as we have one such vector for every known token, this can be seen as our first matrix: think of it having $D_{model}$ columns and $T$ rows. All the numbers in all the slots are assigned a random value initially. Admittedly so far nothing particularly geometrical has happened, except that we called this a list of vectors.

So given some new incoming text that we’ve tokenized and turned into a sequence of integer token identifiers, we can fetch the vector representation associated with each token in the sequence, and now we have a sequence of vectors. These are called embedding vectors.

Important: do not try to understand the meaning of the numbers in these vectors in isolation. They start off random, and they are going to be adjusted during training in a way that depends on everything else about this model. You can understand the abstract machinery of how this works by considering each piece on its own, but you can’t get even a hint of why it might possibly work without considering the whole thing end to end, and we’re not there yet.

The next thing we’re going to do is mix in positional information. The approach may seem reminiscent of Fourier analysis or AM radio. Think of a clock’s second hand ticking round and round. It has a certain frequency at which it makes one journey round the circle: one cycle per minute. The clock hand is a vector. We can describe in it coordinates using:

\[x = \cos 2\pi t / 60\] \[y = \sin 2\pi t / 60\]

The scaling applied to the time $t$ determines how fast it cycles. Now think of time ticking forward in integer increments, so $t$ is an integer.

We could have a whole line of clocks of different frequencies, getting slower along the line. One formula for the $i$-th clock is:

\[x = \cos \frac{t}{b^{i/I}}\] \[y = \sin \frac{t}{b^{i/I}}\]

The term on the bottom of the fraction is some base number $b$ raised to the power $i/I$, where $i$ is the position of the clock in the line-up and $I$ is the number of clocks (say, 500). Supposing we number the first clock as $i=1$, anything to the power $1/500$ is a little larger than $1$, so the value of $t$ is passed almost unmodified to $\sin$ and $\cos$. By the time we get to $i=I$ the fraction $i/I$ is $1$ so we’re passing $t/b$ to $\sin$ and $\cos$. There’s a smooth curve in between, but clocks are distinct, discrete objects so we sample the curve at integer values of $i$.

Why do we need two functions? $\sin$ and $\cos$ are the same function phase-shifted by a quarter cycle. What is gained by both of them? The instantaneous value of $\sin$ alone passes through zero twice per cycle, once on the upswing and once on the downswing. But by also evaluating $\cos$ it is possible to distinguish the upswing and downswing of $\sin$ and vice versa, and know where you are in the cycle from a snapshot. There is something deeply significant about how you need the full picture of the circle, not just one of the coordinates oscillating.

Physical example: a pendulum (which is approximately a harmonic oscillator at small angles) may have position described by $\sin$, but to fully know its state at some instant you also need to know its momentum, which will be proportional to $\cos$, always one quarter cycle out of phase with the position. So its actual state traces out an ellipse in “phase space”.
Pure mathematics: $e^{i\theta}$ traces out the unit circle in the complex plane, being equal to $\cos \theta + i \sin \theta$.

Anyway, what does this have to do with positional encoding? Replace the passing of time $t$ with the zero-based row number $p$ in a matrix, which we’ll call the PE matrix, and let the number of rows equal the number of tokens in the current input. There are $D_{model}$ columns that we number with $i$. We fill in the matrix values by alternating between the $\sin$ and $\cos$ formulae. So each pair of adjacent columns describes a clock handle that is circling in discrete jumps as we scan down the rows. (You could also think of their being half as many columns, but each column holds complex values, and each row is generated by multiplying the previous row’s value by some constant multiple of the imaginary number $i$, the constant shrinking as we look along the columns.)

We can generate this matrix for any number of rows, and it’s always the same, it’s not something that will be adjusted by training.

The key to positional encoding is that for the input token at position $p$, we add the row $p$ of the positional encoding matrix to the embedding vector (remember them?) of the token $p$. The resulting set of vectors is passed to the next stage.

Where similar tokens appear in two places in the input, one of the pairs of dimensions will encode this similarity because its cycle period will happen to align with the distance between the two tokens. This has something of the flavour of AM radio, where there are many carrier frequencies modulated by actual information signals, or Fourier analysis. But there are some major differences:

here we’re summing the vectors rather than multiplying amplitudes,
we aren’t super-positioning (adding) all the frequencies, we’re keeping them separate in their own (pairs of) columns,
the information signal actually starts off random, and we’re going train it to hold real information.

Now we finally have the input vectors to the transformer, but it’s time for a break.

Not yet regretting the time you've spent here?