Two Cheers for SQL
What is there to say about this old stain on the technology landscape? Settle in…
SQL is not "cool". It probably never has been. On the one hand there are the technologies we hate, and on the other the technologies no one uses.
Having spent a few years going back and forth on the merits of SQL, I'm in a weird place. I now think it is both underrated and overrated, and not merely because other people are too extreme in their opinions. I genuinely think SQL is both a fine idea and a terrible idea at the same time. There is a way of using it that makes sense, and many other ways that don't.
Is select
a satire on orthogonality?
The heart of SQL is select
. What's wrong with it?
All modern ("cool") language features for querying data take a consistent pipeline approach, letting you perform successive transformations on data sets, chaining them together in a way that feels natural, like constructive exploration. But not SQL.
The columns in the output of your query are what you have to write down first. This is the opposite of the natural way to think creatively about data. It also breaks apart the two dimensions of the output:
- What columns should be included?
- What rows should be formed?
The columns have to be stated at the start, the rows are (usually) formed at the end. It also confounds any possibility of your editor making helpful auto-completion suggestions. I generally find myself writing select * from
first so I can start from the actual beginning, the source tables, and then come back and replace *
later when I've figured out how to get the data I need with the help of the editor's UI.
The above flaw is corrected by all modern frameworks that provide a consistent way of piping data through composable operators, because they naturally start with the input and let you evolve it, one stage at a time, switching between operators as you require.
And really this is just one symptom of the broader problem with select
, which is that there's nothing modular or systematic about the various features that are crammed into it. They don't compose in flexible ways. They compose in exactly one pre-determined way: select
columns, from
a table, then optionally add join
s to other tables (or apply
, and how about pivot
or unpivot
?), then optionally a where
, then optionally a group by
, then optionally a having
, then optionally an order by
.
It's less like artfully constructing from separate building blocks, and more like filling out a bureaucratic paper form, leaving the irrelevant items blank. Ideas have been retconned into it over the decades, and special homes had to be found for them, because the underlying model is not extensible. In any given revision, it has always carried the pretension that it is finished now, rather than the humility to know that it is never finished and must be open to extension.
How to do composition
But SQL does have a general way of composing operations into a pipeline: common table expressions. It's been in the ANSI standard for 20 years. It's often thought of just as a way to achieve recursion in queries, but it can also be used simply as way of structuring complex queries so they remain modular and readable. If you restrict yourself to only using one feature of select
in each stage, you can get close to the ideal.
Here I want to discover patterns in monetary transactions (going into or out of a business) with different vendors. Each stage can see all the previous stages as if they were views or tables.
-- only interested in recent transactions
with recent_trans as (
select vendor_id,
left(amount, 1) as first_digit,
amount
-- other columns...
from trans
where entered_date >= @start_date
),
-- often only interested in money paid out
invoices as (
select vendor_id,
first_digit,
amount
-- other columns...
from recent_trans
where amount >= 0
),
-- how often do specific digits occur first
invoice_first_digit_frequency_by_vendor as (
select vendor_id,
first_digit,
count(*) as frequency
from invoices
group by vendor_id,
first_digit
),
-- how often do certain amounts occur
invoice_amount_frequency_by_vendor as (
select vendor_id,
amount,
count(*) as frequency
from invoices
group by vendor_id, amount
),
-- and so on...
At the end I can write an update
that saves all the facts and figures about each vendor into its own row. I give the whole task to SQL and let it figure out the fast way to execute it. This is called a set-based operation.
With this ability to work in stages that build on previous ones, you can very easily come up with queries of great power and complexity, while still understanding what you've done and leaving the result easy for others to understand, maintain and reuse; this is nothing that should surprise anyone with experience of compositional features in any other language. They're great.
But the situation is somewhat misleading in the case of SQL.
The significance of performance
There are many constraints on software: the best advice you'll get is to place the most priority on making your software easy to modify without breaking it. Modularity helps, information hiding helps (not letting modules know too much about each other), any sort of high level declarative approach will also help.
At some point you will have to worry about how efficiently the software does its job; but this is just one of the many constraints, and should not be allowed to wreck the primary concern: ease of modification. It is only through ease of modification that you will be able to make the changes that will make it all go faster.
"First make it correct, then make it fast." Sure, but this adage misses a major problem. Some implementations are so slow that they fail to return the correct answer before it becomes worthless. To be that slow is to be incorrect. The most embarrassing bug is a hang; a sufficiently slow implementation is indistinguishable from a hang.
This doesn't mean that performance trumps ease of modification as a concern. But it does mean that we have to understand how our high level description is turned into an executing implementation in order to meet the basic requirement of correctness.
In this regard, SQL is not fundamentally different from any other high-level language. But in practice, due to the degree of freedom it exercises, the results can vary so wildly that SQL ends up being very different from regular languages.
Any modern high-level language will apply some optimisations to the code you write: it looks for patterns that it can rearrange to make things run faster. These can be "non-local" patterns that involve relationships between lines of code that are far apart in your source and so are not immediately obvious unless you know how the compiler works. But this effect is usually quite limited and predictable; it's more like a sprinkling of magic that gives you a general speed-up.
In a wide range of software scenarios it is an excellent idea to let the accumulated expertise of smart people work on your behalf by getting software to automatically translate your high-level constraints into an optimal low-level form that can be directly executed. That is precisely what happens with a regex: it gets converted into a state machine. Same for any high-level programming language that compiles down to either machine code or something JIT-able.
SQL's approach to optimisation
When you send a SQL query to an RDBMS, it also performs a step akin to compilation: the query is analysed to produce an optimised plan that can be executed. But unlike imperative code, there is no structure, provided by you, for it to use as a starting point. The optimisation process starts from scratch, and the approach it takes to constructing the plan is very much dependent on things that aren't in your query:
- the design of the optimiser (which may be closed source)
- the indexes that exist (which you choose separately, and may even be accumulated automatically by the RDBMS)
- the known statistics maintained by the RDBMS about the current contents of indexes (statistics which change over time, and therefore may be inaccurate)
Taking my example above that used chained with... as
stages, it is not the case that each stage is evaluated separately and its results stored temporarily to supply to the next stage. If in the full example the only stage that used recent_trans
was invoices
(supposing I only separated them for clarity) then they could be flattened into one stage. The number of stages in my SQL source code does not translate into more stages in the execution plan. To reiterate: the query optimiser rips up your code and starts again. This is the purported advantage of high-level declarations: they define the constraints on what the results should contain, without specifying how those results should be obtained.
The theory is that by applying its own understanding of how to run fast, the RDBMS is best placed to interpret your constraints into imperative computational steps are actually executed.
Building a ship in a bottle
Have you ever tried to make a ship in a bottle? I haven't but I understand that you typically build the ship outside the bottle but with the masts folded down so it just barely fits through the bottleneck, and then you pull on little cords to bring the masts upright, leaving people astonished and speculating as to whether you had to hold two pairs of long tweezers through the bottleneck and struggle for years to manipulate each piece into place.
Imagine if you did have to build the ship inside the bottle like that? How would that even work? It would be absurdly frustrating.
Well, that's how you optimize a SQL query execution plan. If the plan is doing something stupid, you can't just edit the plan. The plan is not yours to edit. You have to change the conditions that caused the query planner to choose a bad plan, such that next time it will choose a better one. You are "outside the bottle", as it were, in the high-level concepts, and the query planner is hidden inside the bottle, messing with the ship. You are very much at its mercy.
This metaphor doesn't quite capture all the maddening aspects of the situation, because at least with tweezers and a clear glass bottle you can see what you're doing wrong. You have that instant feedback cycle. But in SQL you don't have a complete specification of how the query planner decides what to do based on the available indexes. It's as if the bottle is opaque, and you're holding tweezers equipped with AI. They are trying to guess what you intend as you move them around by the handle, and they keep getting it wrong.
By "wrong" of course I just mean "slow", but as we noted above, such a degree of unpredictability is really no better than simply failing to work. You can think you've written a query that performs acceptably, ship it, and only then discover that under some conditions it performs so badly that it is effectively broken. This is where the sales pitch of SQL leaves us disappointed.
Of course, the database vendors have the solution: they provide a way for you to see the execution plan - to look inside the bottle, so you can tweak the constraints of your environment until the execution plan doesn't do something silly.
But now we've arrived at an absurd place: we're in the business of modifying a high-level description so that an algorithm we are unaware of will produce the low-level description we require. This means we have to understand both, while never entirely understanding how one produces the other. Small changes made to the high-level description may have huge effects on the low-level.
This activity makes no sense. Unless we're doing this because we enjoy being driven to distraction, we have to take direct control of the low-level description. This means we have to re-write our high-level, nicely composed set-based query in another form.
There are ways to do this in SQL. One frequently used approach with a complex query is to make each step dump its output into a temporary table (a non-standard feature that major RDBMS vendors have all implemented in different ways). By doing this, you are effectively controlling the top-level structure of the execution plan, blocking the optimiser from having any influence over it. If you keep the individual steps small and simple then you can pretty much guarantee there is only one, sensible way for the optimiser to execute them.
This is not supposed to be necessary, and that's why ANSI SQL doesn't include temporary tables. But it is necessary, and that's why all major RDBMS vendors have added temporary tables.
See what I mean? You can get SQL to do what you want, as long as you don't do what you're supposed to.
ACID hallucinations
In SQL you are encouraged to enforce consistency inside the database. Per SQL, the right way to design systems is to make them brutally fussy, and to refuse to accept wrong inputs, failing with a hard error. If the caller is using the database incorrectly, that's their problem. The invariants must be preserved. This idea is attractive because if you can reduce the possible states your database might be in, you don't have to write so much code (and so many tests) to deal with all those impossible states.
The range of constraints you can express at the DB level is quite limited: column values can be required to be unique
, or a foreign key
, or meet some simple check
. But this is usually quite enough to tie your database down to an extremely inflexible usage pattern.
The problem is, in reality, we often don't know in advance exactly what these constraints ought to be. Theoretically a Payment
is made to a Vendor
, and so the vendor must exist in the DB before we can store the payment to refer to it. Except…
Sometimes the information arrives in the wrong order, so we need to store the payment now, even though we don't yet have the details for the vendor. In the meantime we could easily tolerate the existence of such orphaned payments; they shouldn't cause errors. They can just be ignored until the vendor record arrives later. Surprising as it is to some, there's no absolute rule saying we must have foreign key
constraints for every relationship. Sometimes it would create an entirely artificial and unnecessary limitation.
To work flexibly the relationship between the two record types will have to be expressed by a natural key: some external real-world identifier for the vendor. (How often do systems built on SQL make up surrogate IDs in tables that don't need them and thus make things worse? Probably a lot.) Also by querying for payments to vendors using an inner join
it's very easy to make incomplete data stay out of sight of your code. Nevertheless, this more ad hoc approach to consistency is often overlooked on the assumption that the "proper" way to do SQL is to plaster your schema with rigid constraints. I'd argue that the proper approach is to solve the actual problem at hand, rather than solving some other, only superficially similar, problem.
I've worked with teams that have gone for years putting up with the inflexibility of their design because it had foreign key constraints that enforced a particular order to the insertion of data, and they believed (through a kind of psychological conditioning) that if they removed the constraints then something bad would happen. In reality, nothing bad would happen and their design would start behaving in a way that would be immediately better for their users.
By the way, SQL's built-in ACID guarantees are another example of how an abstraction can fall apart when faced with reality. Ultimately the RDBMs has to decide when to lock records, or even whole tables, so that other concurrent queries wait their turn. It does this for you, intelligently. Does this ring a bell? The marketing would suggest that this is another one of those technical details best left to the clever machine to figure out for you from your high-level constraints. You can guess how it works in practice: the locking approach is different between RDBMS vendors, and they give you ways of controlling it that are not covered by the standard.
Code generation, a.k.a. build your own high-level language
An important technique in creating maintainable systems is code generation, that is, writing your own tools to generate all the repetitive garbage code so you don't have to hand-maintain it. For me a major sign of the limitations of a language is the frequency with which code generation turns out to be useful. It's occasionally useful with any language, however well-equipped.
But if you find it seems to always be useful with your language, then that suggests that the language is painfully lacking in vital features. I seem to find code generation a continual necessity in SQL. Stack Overflow is full of suggestions involving exec
(SQL Server) or prepare
/execute
(MySQL) or execute immediate
(Oracle) as workarounds. If I want to do something very similar with each of a dozen columns, or I want to do almost the same thing in five different situations except with minor differences, there is generally no way to use a native SQL looping structure or parameterise my code, except by generating the SQL externally.
I'm not sure it could get more ironic than a language adhering to a philosophy of providing purely high-level declarative features so you don't have to write ugly low-level imperative code, but which:
does such a bad job of the translation that you have to restrict your use of it to the point where you are effectively writing low-level imperative code, and
provides such a weak set of high-level declarative features that you have to use it as a kind of compilation target for your own language.
What's actually going on here?
Under the hood of a database, we find standard data structures, and the most important is the B-Tree, which is an excellent way of sorting large quantities of data by some key, and efficiently distributing it across separate pages that can be read/written as contiguous chunks from the underlying file. It makes it fast to seek to a particular key, and then to scan through adjacent keys in their sorted order. Ultimately the purpose of a SQL database is to make the low-level power of a B-Tree available to you in a convenient high-level form. The primary key of a table serves as the sort key for a B-Tree that holds all that table's records. Additional indexes on the table's columns are just further B-Trees that map to the primary key of the target record.
How do B-Trees make queries fast? Mainly by keeping data with logically adjacent keys close together. On a very large dataset the slowest thing about querying is reading the data from disk. To seek to a random location averages about 10 ms, which is long enough to read a megabyte sequentially. Seeking is obviously a huge time suck. It could be reduced by keeping data physically sorted on disk, but that would make insertion horribly expensive (imagine having to copy gigabytes of data along by some offset to make room for a new row). A B-Tree compromises by bunching data together into pages, and each page is read as a whole unit. So there is only a need for an expensive seek each time you scan off the end of the current page. The data doesn't have to be precisely sorted within the pages, but it must be sorted between the pages (that is, it must be partitioned).
This understanding is vital to obtaining acceptable, predictable performance. So what does SQL do? It hides these truths from you as much as possible. They are dirty, low-level thoughts and are beneath you. Roll up, one and all, come and play performance roulette!
But by defining indexes, and only letting the SQL optimiser loose on very simple queries, you can effectively use an RDBMS as a raw B-Tree engine; a means of storing and caching partitioned data so you can choose how to perform lookups and scans within it.
On the other hand, you may find yourself fighting against it more often than you are supported by it. The seductive possibilities of complex queries are always dragging you towards the treacherous cliff-edge of unpredictable performance.
So this is why people sometimes invent their own database system from scratch.