madhadron, a programmer's miscellany

A map of the database landscape

Episode Summary

Choosing a database can be confusing, but there's only a few families you actually care about. This episode gives you a map to the world of today's important databases.

Episode Transcription

Choosing a database for your project can be overwhelming. In the 1990's it was easy, but since then we've seen a Cambrian explosion of databases. But like the creatures that evolved in the Cambrian era, they fall into a much smaller number of families that describe their overall structure. Once you know those families, it's much easier to remember how a particular member's history led it in a different direction than its close kin.

Computers have provided structured forms of storing and retrieving data as far back as the 1950's. On many mainframe systems, a file isn't a sequence of unstructured bytes. It's a table of records with a fixed set of columns. Through the 1960's and 1970's many varieties of database arose, each trumpeted by their adherents as the best way to handle your data.

The key event in the history of databases happened in 1970. A researcher at IBM named Edgar Codd published a paper "A Relational Model of Data for Large Shared Data Banks." Codd was a mathematician and scientist first, programmer second, and his approach to databases reflects that. He asked the question: what is the general mathematical underpinning for modeling data?

Initially IBM wouldn't implement his work because it would cannibalize revenue from their existing, non-relational database, but in the late 1970's they took the plunge on it. They replaced Codd's very mathematical language with a project on natural language programming, which became SQL. Larry Ellison copied it to implement his relational database, Oracle, and a standard was born.

Another time I want to dig into the mental structures implied by the relational model in detail, but today we're focused on the landscape of systems that emerged.

Anyway, the 1980's saw the relational database almost completely eat the database world. It was a shift as drastic as structured programming. Through the 1990's, non-relational databases were a niche idea in obscure communities. The 1990's are when the three relational databases that really matter today were born: Microsoft SQL Server, MySQL, and PostgreSQL.

There are three other relational databases today that are worth a remark before we dig into these big three.

First, SQLite. SQLite is an embedded database, that is, it is a library you use in your program, not a stand-alone program. It stores all its data in one file. It has consciously not implemented the most concurrent data access in favor of simplicity. It is an amazing tool and one you should have in your back pocket if you need to store structured data for a local project or desktop application.

Second, Oracle. Larry Ellison's Oracle was the second relational database created. Today no one buys Oracle directly. Oracle-the-company sells lots of integrated line of business software, much of which uses Oracle-the-database under the hood, which is where Oracle gets sold. There are two reasons for this. First, Oracle has generally had worse performance than its competitors. Second, no one wants to deal with Oracle. This is the company that sued anyone who tried to publish benchmarks on how well their database performed. There's a standing joke in the industry that Oracle is an acronym for One Raging Asshole Called Larry Ellison.

Finally, IBM DB2 is ubiquitous on IBM mainframes. It's a mature relational database and can handle impressive workloads with good performance, but unless you're in the IBM mainframe ecosystem it's irrelevant.

So, back to the big three we mostly care about today: SQL Server, MySQL, and PostgreSQL.

SQL Server is commercial software from Microsoft. It emerged from a project in Microsoft to port Sybase, an earlier relational database, to OS/2, and has continued from there. In the 1990's it was largely restricted to Microsoft shops, but it got ported to Linux, and, at least for a while, actually was faster on Linux. Microsoft implemented a lot of features in SQL Server either before they were standardized in other systems or when they were at the height of their power and preferred not to be compatible, so a lot of more advanced features in SQL Server's SQL dialect are quite different from other systems.

PostgreSQL was a rebuild of the earlier Postgres database (which didn't use SQL). Postgres was a rebuild of the earlier Ingres database to let it support modern features like extensible data types. Ingres was also the basis of Sybase that SQL Server is based on. This really is a very small family tree. PostgreSQL is certainly the most flexible and extensible database available today. You can replace the storage of individual tables, including having them be frontends to other programs entirely. You can define arbitrary types and packages to extend it.

MySQL also comes from Postgres, kind of. Postgres was too resource hungry for very low end machines in the early 1990's, so someone created an alternative, lightweight storage called mSQL with just enough SQL for their purpose. Someone else took that and extended the SQL support and renamed it MySQL.

MySQL and PostgreSQL are best understand in comparison with one another, especially their earlier versions. Early PostgreSQL focused on correctness and flexibility. Early MySQL focused on speed for simple queries. Indeed, in the 1990's MySQL was notorious for data corruption under load the way MongoDB is today. PostgreSQL's storage engine was not. MySQL was faster on simple SELECT queries than PostgreSQL, but as soon as joins entered the picture, PostgreSQL was faster. MySQL can only use one storage engine at a time as opposed to PostgreSQL's flexible, table-by-table storage choices. And perhaps most telling of the philosophical differences between the two projects: if you have a query that is slow in MySQL because the query planner has chosen poorly, the database provides you with hints to force the planner in certain directions. In PostgreSQL if the query planner makes a poor choice, you report it as a bug. There are no hints to provide.

Today, neither have data corruption issues, and since MySQL 8 its query planner mostly doesn't need the hints anymore. For a while MySQL's support for replicating to additional instances of the database was more mature, but PostgreSQL has largely caught up. PostgreSQL still has a vastly more flexible and powerful feature set, so it should be your default choice.

Now we move onto NoSQL systems. There was a lot of sturm und drang in the early 2000's about how relational databases were obsolete and we were all going to move to NoSQL systems. The problem is that most of the NoSQL systems that people were pushing were reimplementing databases that had already been abandoned when relational databases were invented. Alternately, some people claimed their NoSQL systems were much simpler than relational systems. In practice this meant that they lacked features that relational systems had developed by necessity over the previous decades. Most of those systems have spent the last two decades poorly reinventing those same features and becoming less simple. What some NoSQL stores did provide was running as a distributed system from the ground up.

Distributed databases weren't new. Teradata had been doing the same thing with a relational database. But Teradata and other distributed databases up to that point had been focused on analytics workflows, that is, situations when the data was changing slowly if at all and the problem was scaling queries to run across data sets that would not fit on one machine.

This is an important distinction. All the databases we've discussed so far are focused on OLTP, which stands for Online Transaction Processing. Queries regularly update the data set at the same time as it is being read. There is a whole parallel path of evolution for OLAP, or Online Analytical Processing. Most OLAP work is handled just fine by a normal database, but the rise of giant web services like Google and Facebook in the 2000's and the fad for big data in other businesses drove the creation of specialized OLAP systems like Google's map-reduce and Apache Hadoop. The trend among all of these has been that these systems develop, and then someone produces a SQL interface to them. SQL remains the lingua franca of the data analysis world. Today there are a bunch of systems in this space, but I'm not going to explore them in this episode.

As an aside, OLTP and OLAP are an example of a pet peeve of mine. If you're going to name two things to distinguish them, give them really different names. The same is true of authentication and authorization in security. Everyone shortens them both to auth, but which one do you mean? If we had called them login and permissions we wouldn't be confused. Fortunately most people don't have occasion to use OLTP or OLAP. They have one of each, and refer to them by name.

Coming back to OLTP systems, there are three classes of NoSQL systems that we still care about today: graph databases, document databases, and distributed key-value stores.

Graph databases have been around since the beginning. Until SQL recently grew recursive capabilities, it was simply unable to do things like trace a path through a graph. So the people that needed graph databases kept them alive as a niche community even during the dominance of relational systems. Now that SQL has developed recursion we'll see what happens in the next few years. PostgreSQL has a history of adding a few features and suddenly eating a category of database.

Document databases store structured data with variable schemas. Think: a JSON document. Indeed, MongoDB explicitly began as a store for JSON documents. A lot of these grew from systems that stored logs, such as Splunk or ElasticSearch. What fields a particular log entry had varied, but you still wanted to be able to query over them. These systems are extremely useful, especially the modern ones built around tracing like Honeycomb and DataDog.

The other path the led to document databases was people trained in object oriented programming who never learned the relational model, and felt contempt for what they did not understand. That gave us systems like MongoDB. MongoDB had one great virtue in its early years: it was the easiest system to set up replication from a primary instance to a set of read only instances that you could get. That virtue was rather blemished by its track record of corrupting the data it replicated and losing data on disk. It has spent the last decade trying to patch up the holes and figuring out how to wedge the affordances that Codd built into the relational model into their system. If you have to run MongoDB, the best way to do it is to use Azure's CosmosDB with its MongoDB emulation layer. PostgreSQL also has a native document type which pretty much subsumes MongoDB without the data integrity issues.

The final class of NoSQL systems we care about are distributed key-value stores like DynamoDB from Amazon or Cassandra. These were systems designed to run as resilient distributed systems with predictable performance on large collections of machines. They provide a little more than a simple key-value store, but not much.

There's a reason for that: at the time they were developed, we simply did not know how to implement the higher level features of a relational database like transactions in a way that would scale smoothly with predictable high performance. These were built for the specific needs of huge systems like AWS and Facebook where they could afford to throw man-hours at working around the missing relational features because the alternative was not having a system at all.

It took some years to start solving those problems and building a new set of distributed relational systems. Then, because database folks tend to be very conservative, most of us sat and watched them mature for a decade or so. For data you care about, it's always better to let someone else be the case study, and there never seems to be a shortage of people willing to step up and do so. Anyway, these new databases got the moniker of NewSQL.

The two big threads of NewSQL systems were those that imitated Google's Spanner and needed precise time synchronization among all the nodes of the system, and those that imitated the Calvin system which did not require such precise timing. Calvin's bet was that the time synchronization requirements were a nonstarter for most people. Google had put in a network of atomic clocks to do it. What actually happened is that more and more people ran their workloads in clouds like AWS, which were happy to install time synchronization in their giant data centers, and a new generation of software time synchronization techniques emerged which are making Spanner-like systems reachable even without the atomic clocks.

With the time synchronization, Spanner's a better bet. The only Calvin derived system in general use that I'm aware of is FaunaDB, and they have pivoted to trying to provide high level backends for app developers. The other four important NewSQL systems—Google's F1, CockroachDB, YugabyteDB, and TiDB—are all based on Spanner. F1 is actually a layer on top of Spanner and only available in Google Cloud. If you're not in Google Cloud, then it's irrelevant to you. If you are, it's a good choice.

So how are the other three different?

CockroachDB began by focusing on correctness and clean design at the cost of a smaller feature set. They wrote their own query parser and query planner. They began by using the open source RocksDB storage engine for handling data on disk, but ended up reimplementing the pieces they needed, which removed a lot of complexity and corner cases from their codebase.

YugabyteDB made a different bet: they support the entirely of PostgreSQL's large feature set, because they took PostgreSQL and replaced its storage engine with a distributed interface to RocksDB, the same as Cockroach began with. This gave them a huge feature set immediately, but every change they make has to tiptoe around how the PostgreSQL codebase, their coordinator codebase, and the RocksDB codebase interact. It was a good bet short-term, but I don't like it longterm.

TiDB is somewhere inbetween. It uses RocksDB under the hood, and etcd to handle the problems of leader election and cluster state, but implements its own query processing. TiDB has one more very interesting property. It has two storage engines, one for OLTP and one for OLAP. You can run different storage engines on different nodes and control how data is replicated among them to have a single system for both your transactional and analytical workloads.
Of the three, TiDB and YugabyteDB provide their full versions as open source software. CockroachDB provides a subset of the system as open source. The full one is commercial.

So, let's recap what the landscape looks like today. This will be a massive simplification. There are still massive Oracle and DB2 installations. SQLite is embedded in Apple's frameworks and does heavy lifting on every macOS and iOS device on the planet. But if you're sitting down to build a system that needs to have various people and programs connecting to the data store, this will be a good enough map.

For most purposes, a classical relational database is still your best option. Unless you already have a lot of expertise and cost sunk into SQL Server or MySQL, you should probably start with PostgreSQL.

If you have a problem where you are linking and tracing through graphs of entities, a graph database like Neo4J is still the way to go, but keep an eye out as the recursive and graph processing capabilities of the relational systems mature.

Document databases like Honeycomb are excellent for searching tracing data captured from your system monitoring. A document database should not be your primary data store. Take the time to learn the relational model.

If you need a distributed database today and your organization doesn't already have a lot of sunk expertise and workload on DynamoDB or Cassandra, look at the NewSQL systems. If you're willing to pay ongoing licensing fees, consider Cockroach. If not, look at TiDB.

But for 90% of your cases, you probably just want PostgreSQL.