Vectors - Intuitions | Daniel Earwicker

We encounter vectors in a lot of topics, but I find I have to think of them in multiple ways. Tempting as it is to try to figure out which is the main or most fundamental way to think about them, I don’t think that kind of hierarchical thinking applies here. The three ways are:

arrows sprouting from an origin and pointing in some direction in a geometrical space
collections of numbers, $\mathbb{R}^N$ or $\mathbb{C}^N$, called coordinates
abstract sets of any objects that obey certain rules

The abstract rules describe the other two ways, so that suggests an underlying unifying system, but then the only reason any significance was attached to those rules is because they relate to the behaviour of arrows in physical spaces. On the other hand, and any system that obeys those rules can be described by coordinates, which makes coordinates seem like the universal lingua franca of vectors, but on another (third?) hand, this only works once you’ve chosen some basis vectors, so perhaps it doesn’t seem fundamental after all.

The language used throughout this topic, even when dealing with abstractions, such as (in Quantum Mechanics) the vector space of complex-valued functions of a real variable, is nevertheless full of allusions to geometrical meaning. Thinking visually is indispensable (for me, anyway). An analogy can be a risky thing to depend on because it may have extraneous details that mislead you toward conclusions that have nothing to do with the real system you’re trying to study.

But fortunately when it comes to vectors, the analogy of arrows-in-space is so simple and so closely exemplifies the key abstractions, it can be relied upon as a visual thinking aid, which can be a life-saver if you’re a visual thinker.

But really, all three ways of thinking about vectors are inextricably interrelated, and they reinforce each other in building your understanding. So I think the best approach is to get comfortable with all of them so you can mentally switch between them to understand the same problem in multiple ways: visualising, calculating and doing algebra.

Clarity Regarding Arrows in Space

Picture a physical space, and then picture a single specific point in that space. Is it a vector? No it is not.

But if you picture a second point, and then imagine an arrow reaching between them, that arrow is a vector. So it’s an essential distinction that a vector doesn’t identify a point by itself, but a vector is in a precise sense the difference between two points, which is a quantity which is not an ordinary number; it’s something else entirely. Note we’re not talking about the distance between two points, which presumably would be a number. We’re talking about something that must incorporate distance and direction into one combined object.

It’s worth pausing to interrogate this a bit: we’re talking about subtracting one point from another, and the result is a vector. This is not quite how it works with numbers: subtracting one number from another gets you another number, where as here the result of the subtraction is a different type of object. But in another sense it is just like regular subtraction, in that it is not commutative.

That is, suppose you had two points $\mathcal{P}$ and $\mathcal{Q}$, and you considered the difference between them, which we could write as $\mathcal{P} - \mathcal{Q}$. That’s one vector, an arrow that reaches from $\mathcal{Q}$ to $\mathcal{P}$, sometimes written as $\vec{QP}$. But looking at them the other way round, $\mathcal{Q} - \mathcal{P}$ is the arrow that reaches from $\mathcal{P}$ to $\mathcal{Q}$, or $\vec{PQ}$. These are equal-but-opposite arrows, covering the same distance but travelling in opposite directions. One of them is the negation of the other, i.e.

\[\vec{PQ} = -\vec{QP}\]

Now, if you pick one special point and call it “the origin”, $\mathcal{O}$, then you can draw an arrow from $\mathcal{O}$ to any other point in the space, and thus you’ve established a mapping between points and vectors. So you can now say (if you like) that a point “is” a vector. But the thing is that you had to choose the origin, and you could have put it anywhere.

Again, supposing you start with two points $\mathcal{P}$ and $\mathcal{Q}$ on a piece of paper right in front of you, and you draw a dot for the origin $\mathcal{O}$ exactly halfway between them. You can then draw arrows $\vec{OP}$ and $\vec{OQ}$. These two vectors are again the same length, but pointing in opposite directions. But if we’d put the origin somewhere off in the Andromeda galaxy, the two vectors would need to reach a couple of million lightyears to end at the two points on your piece of paper, and would be almost pointing in the same direction. So changing the origin completely changes the kind of vectors associated with the same two points, and this shows how ridiculous it would be to conclude something about the points themselves based on the vectors that we chose to associate with them.

A related distinction between points and vectors is that you can add two vectors, but you can’t add two points. Yes, of course you could chose an origin, map two points to vectors against that origin, and then add those two vectors to get a third point relative to that origin, but again, consider the above scenarios: if the origin lies between the two points, the sum of those two points will be the zero vector (the two opposite vectors cancel each other out), but if the origin is in Andromeda the sum will be a point far off in the opposite direction of that galaxy, about twice as far away. It’s not physically meaningful to add two points, because you’re just finding out about where you placed the origin, rather than finding out something useful about the two points themselves.

Incidentally, a space defined as a set of points, for which we have a subtraction operator that returns vectors, is called an affine space.

Adding Vectors

Picturing an arrow $\vec{PQ}$ reaching between two points as the vector result of subtracting one point from the other, $\mathcal{Q} - \mathcal{P}$, this makes it rather obvious that starting with point $\mathcal{P}$ we can add the same vector to it and reach the point $\mathcal{Q}$.

And if we can do this once, we can add the vector again and reach a third point $\mathcal{R}$. What is $\mathcal{R} - \mathcal{P}$? It’s another vector that covers the whole distance in a single jump. Intuitively, it’s the vector you get by adding $\vec{PQ}$ to itself, or the vector you get by stretching $\vec{PQ}$ to be twice as long, $2\vec{PQ}$.

This is an example of adding colinear vectors. The test for whether two vectors $\vec{a}$ and $\vec{b}$ are colinear is whether $\vec{a} = k \vec{b}$ where $k$ is a scalar (which is what we call ordinary numbers in this context, because that’s their job: to scale vectors).

But we can also add vectors that are not colinear (that are linearly independent). Geometrically we picture one of the vectors moving so its tail end is attached to the head of the other. In this way we make two jumps from the origin to a destination, and the vector that would take us directly from the origin to that destination is the sum of the two jumps.

If the two vectors you add are colinear, their sum will be colinear with both of them. If the two vectors you add are not colinear, their sum won’t be colinear with either of them. These conclusions are obvious if you picture the arrows, where colinear literally means “lying on the same line”.

The Time Line

A time line, where each point is some moment in history, is a great example of a one-dimensional affine space in which we can find vectors.

Of course we can put an origin anywhere (or rather, any when) in the line. The Common Era origin we use for numbering our years places it 2025 years ago (as I write this), though we will gloss over the fact that there is no year zero in that traditional numbering system.

But whether we have an agreed origin or not, we can measure the time between two events using a stopwatch. The numerical value we so obtain is the length of an arrow connecting two points on the timeline, and everyone can agree on that (well, it gets complicated if people are carrying their own stopwatches and have very different motions between the start and finish events, but that’s another story…)

If we do insist on choosing an origin so we can associate a numerical value with every moment in time, we need to ban ourselves from erroneously trying to add those values. If you add the moment of your birth to the moment you first jumped in a puddle, what does that mean? Nothing. This is why programming language libraries for dates and times provide different classes of object for representing an instant of time and a timespan. The class for an instant does not have an addition operator accepting a second instant, but it does have a subtraction operator. The timespan has an operator for adding another timespan, and you can also add a timespan to an instant.

Thus the instants are points (elements of an affine space) and the timespans are vectors (elements of a vector space).

Dimensions and Bases

We’ve been distinguishing between the two-dimensional spaces we find on pieces of paper and one-dimensional spaces like a timeline. But we’ve also seen a hint of how formally we can count the dimensions in a space. Pick any two points on a piece of paper, one of which you call the origin, and there is vector that reaches from the origin to the other point. By scaling that vector by some variable $x$ (which of course may be negative or zero) you can reach any point on the line that passes through the points and stretches to infinity in both directions. This means you can use a single number $x$ to identify, equivalently:

all the points on that infinite line, so $x = 0$ refers to the origin, and
all the vectors that reach from your chosen origin to those points, so $x = 0$ refers to the zero vector.

That makes it a one-dimensional vector space. The vector you picked originally is the basis vector of your coordinate system, and in keeping with tradition we’ll call it $\vec{i}$.

If you pick another point that isn’t on that line (it can be literally any point) and draw the vector from the origin to that point, you have a second basis vector. This vector is linearly independent from the first: it reaches a point that cannot be reached by scaling the first basis vector by any amount. Following tradition again, we’ll call it $\vec{j}$. Once again, you can introduce a variable $y$ by which you can scale it to reach any point on a line. But more impressively, you can now separately scale the two vectors $\vec{i}$ and $\vec{j}$ and add the resulting vectors to reach any point on the piece of paper:

\[x \vec{i} + y \vec{j}\]

This is pretty much exactly what we mean by a two-dimensional vector space: we use two numbers $(x, y)$ to identify points in it.

Likewise we can choose a point somewhere in the universe that is not in the plane that our piece of paper sits in, say, hovering a little way above the page, and picture an arrow from the origin to that point, giving us another vector $\vec{k}$ that is not colinear with $\vec{i}$ or $\vec{j}$, and so we need a third variable $z$ to allow us to build vectors that reach any point in three-dimensional space:

\[x \vec{i} + y \vec{j} + z \vec{k}\]

As well as running out of letters for variables because we started too near the end of the alphabet, here the human imagination hits a basic limit that could well be part of the evolved, instinctive machinery of our minds. Logically we can see how it might be possible for there to be a fourth direction, implying the existence of vectors that are linearly independent from $\vec{i}$, $\vec{j}$ and $\vec{k}$ and requiring another variable. But we can’t picture it. It doesn’t matter; clearly the mathematical logic will work the same, and we can use the spaces that we can picture to get an intuitive feel for how higher-dimensional spaces work.

We can also solve the letter shortage by just writing all the coordinates as $(x_1, x_2, \dots x_n)$, and traditionally the basis vectors are likewise written as $\vec{e}_1, \vec{e}_2, \dots \vec{e}_n$.

An example of a background assumption that we might carry with us is the idea that the basis vectors are orthonormal (a combination of orthogonal, meaning all mutually at right-angles, and normalised, meaning all the same length). This is not necessarily the case. If we were developing the idea of vectors from abstract mathematical principles, we wouldn’t (yet) have introduced a way of saying what the angle between two vectors is, or even saying what the length of a vector is, for that matter. So we wouldn’t be able to say whether two vectors were orthonormal or not. The ideas of length and angle are not fundamental requirements of vector spaces, and this is one way in which the arrow mental picture may mislead us into making assumptions, as length and angle are so natural to that picture. But with any kind of vectors, we can always judge whether they are colinear.

Coordinates

We’ve now arrived at the second view of vectors as collections of coordinate numbers, where there are as many coordinates as there are dimensions in the space. Whatever kind of exotic objects our vectors are, we can choose basis vectors which we can scale and add to find other vectors, and the numbers by which we scale automatically serve as coordinates that identify those vectors. The basic operations (scaling and adding) become utterly trivial when they are performed on the coordinates, because you just scale or add the coordinates individually. This makes vectors-as-numbers a fundamentally important representation. One way to think of it:

vectors-as-arrows are the universal language of visualisation for vectors, while
vectors-as-numbers provide the machinery for computation with vectors.

An important thing about vectors-as-numbers is that, in a big difference from vectors-as-arrows, there is a glaringly obvious set of basis vectors that are inherently orthonormal. These are often called the one-hot vectors, or the standard basis, in which one coordinate is $1$ and all the others are $0$, so there is necessarily one such vector per dimension.

Given this, we rarely think about any other basis for vectors-as-numbers. If you have a space of vectors-as-arrows, and you choose some basis vectors from that space, and use them to describe arrows with coordinates, what happens if you convert one of your basis vectors to coordinates? You get a one-hot vector. There is no particular reason to use anything other than one-hot basis vectors, except perhaps for playing around with these ideas.

Extracting a Coordinate From a Vector

We’ve seen how given coordinates $(x_1, x_2, \dots x_n)$ and basis vectors $\vec{e}_1, \vec{e}_2, \dots \vec{e}_n$ we can build the vector corresponding to those coordinates:

\[\vec{v} = x_1 \vec{e}_1 + x_2 \vec{e}_2 + \dots + x_n \vec{e}_n = \sum_n x_n \vec{e}_n\]

But given the vector $\vec{v}$, how can we pull the coordinates back out of it, so to speak? It’s a bit like asking “how much $\vec{e}_1$ does $\vec{v}$ have in it?” The answer will be $x_1$. Suppose there was a function $f_1$ that you could pass a vector and it would return you the value of the first coordinate:

\[f_1(\vec{v}) = x_1\]

You’d need one for each coordinate, that is, for every basis vector $\vec{e}_n$ that you can scale with a coordinate $x_n$ and incorporate into a sum to make a vector $\vec{v}$, there is a corresponding function $f_n$ to which you can pass $\vec{v}$ and it gives back the coordinate $x_n$.

You may be thinking, wait a second, if vector $\vec{v}$ is already expressed in coordinates, relative to some basis, in which case those basis vectors are the one-hot vectors, then surely to get the 3rd coordinate value all you have to do is… read off the 3rd position in the list of numbers that describe the vector?

And you are absolutely right. This is precisely why vectors-as-numbers in the standard basis are useful. They make this stuff trivial. But in that case, why make such a big deal of talking about a function for doing this? Because we want to know formally how to extract coordinates from any kind of vector. Otherwise how do we get vector-as-numbers in the first place? We’re inching toward the abstract view of vectors, where a vector space is a set of objects that conform to certain rules. This is why we have to think about initially mysterious functions that somehow pull coordinates out of vectors, without saying how they do it. We just assume they exist.

The Abstract Vector Space

It’s time to consider the third way of thinking about vectors, as a set of objects of an unspecified nature, on which we define a couple of operations. First, addition: there is an operator $+$ that takes two objects from the set and returns another from the same set. This operator is commutative:

\[\vec{u} + \vec{v} = \vec{v} + \vec{u}\]

and associative:

\[\vec{u} + (\vec{v} + \vec{w}) = (\vec{v} + \vec{u}) + \vec{w}\]

There is a special object called $0$ (the zero vector), which makes no difference when added to any object from the set:

\[\vec{v} + 0 = \vec{v}\]

Also every object has an opposite, known as its additive inverse, so they pair up. The inverse of $\vec{v}$ is written as $-\vec{v}$, and:

\[\vec{v} + (-\vec{v}) = 0\]

The above can written as $\vec{v} - \vec{v}$. Evidently $0$ is its own inverse. And we can check all these requirements against vectors-as-arrows (where $0$ is like an arrow of zero length and no direction) and vectors-as-numbers (where $0$ has zero for every component).

Second, we need to be able to scale vectors, but this introduces another kind of object. In classical physics we almost always use the real numbers $\mathbb{R}$ as the scalars (in QM we use the complex numbers $\mathbb{C}$).

Our vectors can be multiplied by a scalar to get another object. Scaling them by $1$ makes no difference. Scaling them by $-1$ discovers the additive inverse.

Given two scalars $a$ and $b$, we can compute $c = ab$ and then scale an object $\vec{v}$ by it, or we can separately scale the object first by $a$ and then by $b$, and the result is the same:

\[(ab)\vec{v} = a(b\vec{v})\]

Scaling is distributive over addition of objects:

\[a(\vec{u} + \vec{v}) = a\vec{u} + a\vec{v}\]

And also over addition of scalars:

\[(a + b)\vec{v} = a\vec{v} + b\vec{v}\]

Again, arrows and columns of numbers have no problem meeting these requirements.

The Dual Space

Any vector you can obtain by scaling $\vec{e}_1$ alone must be colinear with it, and so such vectors don’t “contain” any $\vec{e}_2, \vec{e}_3 …$ mixed in. Whereas $\vec{e}_1$ “contains” exactly one copy of… itself! Important: these are statements about the coordinate system defined by our choice of basis, not about some geometric relationship between pairs of basis vectors. They could be at right-angles, or they could be almost-but-not-quite pointing in the same direction, and we’d still say that they serve as the basis for totally independent coordinates.

So going back to our definition of coordinate-extracting functions, a matching pair like $f_1({\vec{e}_1})$ yields $1$, but mismatches like $f_1({\vec{e}_2})$ must produce zero. Or to put it more succinctly, we use the symbol $\delta_{nm}$, known as the Kronecker delta:

\[\delta_{nm} = \begin{cases} 1, & \text{if}\ n=m \\ 0, & \text{otherwise} \end{cases}\]

And so:

\[f_n(\vec{e_m}) = \delta_{nm}\]

What do we call these $f_n$? They go by many names, but I’m going to go with covector. This makes them sound like another kind of vector, which in fact they are. Why is that? How is that? Remember, we just built the abstract definition of a vector space, without saying what these vector objects are. They can be functions!

First, if we have vectors $\vec{v}$ and $\vec{w}$, described in our basis $\vec{e}_n$ by coordinates $v_n$ and $w_n$, that is:

\[\vec{v} = \sum_n v_n \vec{e}_n \,\, , \,\, \vec{w} = \sum_n w_n \vec{e}_n\]

What happens when we apply $f_m$ to to the sum of $\vec{v}$ and $\vec{w}$?

\[f_m(\vec{v} + \vec{w}) = f_m(\sum_n v_n \vec{e}_n + \sum_n w_n \vec{e}_n)\]

Gathering like terms and pulling out common factors (we’re just adding a couple of summations, so it’s a long list of things being added):

\[f_m(\vec{v} + \vec{w}) = f_m \left(\sum_n [v_n + w_n] \vec{e}_n \right)\]

But by definition $f_m$ extracts the $m$-th coordinate. That’s just the sum of the two $m$-th coordinates. So it’s actually:

\[f_m(\vec{v} + \vec{w}) = v_m + w_m\]

And we can obtain those two components from the individual vectors using $f_m$:

\[f_m(\vec{v} + \vec{w}) = f_m(\vec{v}) + f_m(\vec{w})\]

In other words, evaluating $f_m$ once on the sum of two vectors gets the same result as evaluating it separately on the two vector and adding the results. This is one of the requirements of linearity. By a very similar argument we can show the other requirement. Scaling by $k$:

\[f_m(k \vec{v}) = k v_m = k f_m(\vec{v})\]

So we can scale the vector by $k$ and give it to $f_m$, or we can give the original vector to $f_m$ and “scale” the resulting number by $k$, and obtain the same result.

Therefore, merely by requiring that $f_m$ does the job of pulling out a coordinate of a vector, we have necessarily required that $f_m$ is linear. This is more than just a curiosity. First, we can define addition on covectors:

\[h(\vec{v}) = f(\vec{v}) + g(\vec{v})\]

This is pointwise addition, where the two covectors $f$ and $g$ are independently made to act on the input vector and we just add the resulting scalars to get the result. It’s like $h$ is a black box that accepts a vector through a slot, and out pops a scalar - just like any covector - but inside the box it has $f$ and $g$ wired up to each accept a copy of $v$, and then their answers are summed to produce the overall output. Likewise we can define pointwise scaling:

\[g(\vec{v}) = k f(\vec{v})\]

This time $g$ is a black box that from the outside works like any other covector, but internally it simply passes the input vector to covector $f$ and returns the result multiplied by a constant $k$.

But due to the linearity of covectors, all the rules of an abstract vector space turn out to be satisfied by covectors, which means that they are in fact a kind of vector. The zero covector is the function that always produces zero, ignoring the input vector entirely. In relation to the original vector space, we say there is a dual space containing the covectors that can act on those vectors.

Further, even though we began by imagining only one covector per dimension of the vector space, these are now revealed to be basis vectors of the dual space. From these we can create weighted sums that build other covectors, just as we can build any vector from a weighted sum of basis vectors. Any covector $h$ can be expressed by scaling and adding the $f_1, f_2, \dots f_n$.

\[h(\vec{v}) = h_1 f_1(\vec{v}) + h_2 f_2(\vec{v}) + \dots + h_n f_n(\vec{v}) = \sum_n h_n f_n(\vec{v})\]

That is, covectors can be described by coordinates. But this gives us a way to tie all this abstraction back to vectors-as-numbers, and here’s where everything collapses into something almost suspiciously simplistic.

Any covector $g$ acts on any vector $v$ to give us a scalar $s$:

\[s = g(\vec{v})\]

But $g$ is a weighted sum of basis covectors $f_n$, using coordinates $g_n$, and we know how summing and scaling of covectors works - it’s pointwise:

\[s = \sum_i g_i f_i(\vec{v})\]

Also, the vector $\vec{v}$ is a weighted sum of basis vectors $\vec{e}_n$ using coordinates $v_n$:

\[s = \sum_i g_i f_i( \sum_j v_j \vec{e}_j )\]

But covectors are linear, so the covector acting on summed vectors is the same as the sum of the covector acting individually on the vectors:

\[s = \sum_i \sum_j g_i f_i( v_j \vec{e}_j )\]

And again by linearity, a covector acting on a scaled vector is the same as scaling the result of the covector acting on the vector:

\[s = \sum_i \sum_j g_i v_j f_i(\vec{e}_j)\]

But $f_i(\vec{e}_j)$ is just every combination of a basis covector acting on a basis vector. We defined this in the first place as $\delta_{ij}$ (zero if $i \ne j$, one if $i = j$) so all the terms where $i \ne j$ must vanish, leaving:

\[s = \sum_i g_i v_i\]

We no longer need the two “nested loops” for the sum: the effect of $\delta_{ij}$ is to eliminate one of the indices, as everything is zero except the main diagonal.

So to compute the action of a covector on a vector, get their coordinates, and then pair them up, multiply them, and sum those products. This is a weirdly simple result.

If you’re familiar with the so-called dot product, here it apparently is. But note that the two sets of coordinates being paired up, multiplied and summed are necessarily those of a covector and a vector. In general this formula doesn’t work for two vectors (or two covectors). To use this dot-product like calculation, they must be opposite types.

Also look how symmetrical it is: in this calculation, you can’t tell which one is the vector and which the covector. We could just as well say that the vector is a function that “acts on” the covector. The dual spaces are mutually dual. And given this, and also our discovery that covectors are in fact vectors, it’s time we adopted a more “grown up” notation.

We actually have a choice (standards, eh?) When we’re talking about Relativity, similar to how we use $\vec{e}_n$ as the notation for a basis vector, we use $\vec{e}^n$ for the basis covectors. That way, we’re writing them as vectors (which they are), but we’re using superscript instead of subscript for the index. This has all kinds of nice by-products, and turns out to be (mostly) worth the potential for confusion around raising things to powers. Just remember that in this area, unless otherwise specified, $\vec{e}^2$ doesn’t mean we’re squaring anything. And we use the opposite index position for coordinates. That is, a vector’s coordinates should be written $x^n$ (again, we’re not raising anything to a power here!) whereas a covector’s will be $x_n$.

And second, when a covector acts on a vector, or when a vector acts on a covector (as we’ve seen, it makes no difference how you think of it), we can use:

\[\langle \vec{a},\vec{b} \rangle\]

I haven’t said which of $\vec{a}$ or $\vec{b}$ is the covector, because it doesn’t make any difference to the result. But it is absolutely a rule that one of them must a covector and the other a vector, i.e. they have to be of opposite types.

Also note that this is far from being a universal notation. Some people use $\vec{a} \cdot \vec{b}$ like it’s the dot product, which seems reasonable given that the arithmetic turns out to be identical, but I think that’s liable to be confusing because we first learn about the dot product as something that operates on two vectors, and we still haven’t explained why that’s sometimes allowed. I like the angle brackets because it resembles Dirac’s notation for quantum mechanics, which is somewhat related.

Pairing Up Any Vector With a Covector

Think about how we started with only a handful of covectors, which pair up with the basis vectors, but then we discovered that those covectors are just the basis covectors of a dual vector space, so we’ve conjured up an infinite space of covectors. There are as many covectors as there are vectors. So do all the vectors and covectors naturally fall into pairs?

The answer is no. We’ve only paired the basis vectors with the basis covectors, which leaves all the others dangling independently. We can say how we’d like them to pair up, but we haven’t done that yet. We’d need to specify a function that takes a vector and produces a covector, and likewise a function that maps the other way. Note that the first function takes a vector and produces a covector, which is another function that takes a vector! If you’re familiar with functional programming, this is hugely reminiscent of currying.

It’s as if we’ve built, by accident, a function that takes two vectors and returns a scalar. Note: not a vector and a covector, but two ordinary vectors. Internally it works in two stages: first there’s this new “converter” that turns the first input vector into a covector, then that acts on the second input vector, to get the final scalar output.

Or to put it another way, if you have:

two vectors
a way to get a scalar from a vector and a covector
a way to get a covector from a vector

Then just convert one of your vectors into a covector, and apply that to the other vector, and you’ve get a scalar.

Some standard terminology:

A machine for converting a vector into a covector is called a metric, or sometimes the lowering operator.
A machine for converting a covector into a vector is sometimes called a inverse metric (really the two machines are mutually inverse, of course), and more often a raising operator.
A machine for getting a scalar from two vectors (by first converting one of the vectors into a covector using the lowering operator) is called an inner product.

So what’s our machinery for converting vectors into covectors? We can describe vectors and covectors with coordinates. The mapping between them will be linear, so each coordinate of the output covector will be a weighted sum of the coordinates of the input vector. We can write these weightings down as a matrix, $g_{ij}$. So to convert a vector $\vec{a}$ to a covector $\vec{b}$:

\[b_i = \sum_j g_{ij} a^j\]

At least, that’s the vectors-as-numbers way of putting it. To be a little more abstract about it:

\[\vec{b} = \sum_i \vec{e}^i \sum_j g_{ij} \langle \vec{e}_j , \vec{a} \rangle\]

We want the mapping to be one-to-one, so there must be an inverse matrix that turns covectors into vectors, and so we call that $g^{ij}$ - note the superscript indices.

There’s a pattern emerging here that begins to explain this rather wacky use of subscript and superscript indices. An object with indices is a little calculating machine that accepts one input for each index, and if the index is “down”, that input must be a vector (so that vectors coordinates will have the up-index), where as if the index is up, the input must be a covector (with down-index). It’s a kind of compatibility-checking system built into the notation. If it has multiple indices, that tells you how many inputs (and of which types) it needs to produce a scalar.

But now we are ready to answer the question: when are we allowed to use the dot product machinery directly on two vectors? Well, looking at it purely numerically there’s the obvious situation where $g$ happens to be the identity matrix. In that case, our mapping of vectors to covectors is trivial: each covector has the same coordinates as its dual vector. Therefore there’s no danger from us “forgetting” to convert a vector into a covector before we use $\langle \vec{v}, \vec{w} \rangle$ to get a scalar.

We noted how being able to convert a vector to a covector automatically gives us a way to get a scalar from two vectors, and we can get straight to the answer like this:

\[s = \sum_i \sum_j g_{ij} v^i w^j\]

Although we arrived at this by thinking of $g$ acting on $\vec{v}$ to get a covector, which then acts on $\vec{w}$, numerically we can see it’s just a lot of multiplying and summing: every combination of coordinates is multiplied, and then weighted by a specific value from $g$, and then all that gets summed.

We want this scalar result to be the same if we swap the vectors around, i.e. symmetric, as we’re looking for ordinary numbers that somehow describe the relationship between the vectors, regardless of any arbitrary choices we might make. It follows that $g$ must be a symmetric matrix, i.e. $g = g^{\intercal}$.

We are now in a position to formally define orthogonality between vectors and the significance of it:

\[\sum_i \sum_j g_{ij} v^i w^j = 0\]

That is, for two $\vec{v}$, $\vec{w}$, if their inner product is $0$, they are orthogonal. It’s a bit like the opposite of being colinear: they make no contribution to each other. Note that although we’ve written this down as if we are describing them in coordinates, this is not a fact about the coordinates, but a fact about the two vectors and their geometric relationship. Thought of as arrows, they are at right-angles to one another. This has nothing to do with how we’ve mapped them to coordinates. So to clarify that, we’ll introduce an abstract, coordinate-free notation for the inner product:

\[\left( \vec{v} , \vec{w} \right)\]

With parentheses, we mean that it operators on two vectors, distinct from the action of a covector on a vector. And we can link this back to the matrix $g$ as follows:

\[g_{ij} = \left( \vec{e}_i , \vec{e}_j \right)\]

That is, we can use this test for orthogonality on our basis vectors, and by testing every combination of them, we recover the values of $g$. You may want to slam on the breaks here and cry “What? Isn’t that just going to be the identity matrix every time?” No. If we undertook the same kind of survey of every pair we can form between a basis covector and a basis vector, and having them act on each other to get a scalar, then that will be the identity matrix, by definition:

\[\langle \vec{e^n} , \vec{e_m} \rangle = \delta_{nm}\]

But we’re not doing that. We’re forming all possible pairs between basis vectors and comparing them using the inner product.

Obviously a basis vector can’t be orthogonal (at right-angles) to itself, but we could require all basis vectors to be orthogonal to each other. So where $i \ne j$:

\[\left( \vec{e}_i , \vec{e}_j \right) = 0\]

But for this to be true for any choice of basis, $g$ must be a diagonal matrix, i.e. zero everywhere except the main diagonal. We can be even more restrictive by requiring all basis vectors to be orthonormal, which means that in addition to the orthogonality requirement, we expect the inner product of a vector with itself to be 1:

\[\left( \vec{e}_i , \vec{e}_i \right) = 1\]

And this finally forces $g_{ij}$ to be $\delta_{ij}$, the identity matrix. So there’s our answer: if our chosen basis vectors are all orthonormal to each other, then the coordinates of a vector also describe the corresponding covector, and so $g$ is the identity matrix, the conversion from vector to covector requires no change to the coordinate representation, and so we can “get away” with using the dot product arithmetic to get the inner product of two vectors.

So why do we ever care about some other kind of exotic metric or inner product? Because:

In Special Relativity, we have to use a metric for flat 4-dimensional spacetime which is almost the identity matrix except one of the elements on the diagonal (representing time) has the opposite sign to the other three (representing space), which means vectors may have negative squared-lengths (that is, imaginary lengths!), or that vectors other than the zero vector can have zero length.
In General Relativity, we have to use a metric for curved 4-dimensional spacetime which, although still symmetric, is otherwise allowed to be a mess; what’s more, we will generally need a different such metric for every point in spacetime, so it’s a metric field, and each point has its own set of basis vectors, which will diverge from orthonormality as we stray from the origin of our coordinate system.
In Quantum Mechanics, we strictly use orthonormal basis vectors, but our scalars are complex numbers, so the coordinates of vectors are complex numbers. This creates another problem: we require the length of a vector to be a positive real number. If we used the dot product directly between vectors, we’d get sometimes get negative or imaginary scalar results. To deal with this we have to fix the inner product appropriately.