Scalability and the Relational Model

The relational model is just a way to represent reality. It happens to have some very useful properties, such as closure over many useful operations — but it’s a purely logical model of reality. You can implement relational operations using hash joins, MapReduce, or pen and paper.

So, right away, it’s meaningless to talk about the scalability of the relational model. Given a particular question, it might be difficult to break it down into bite-sized pieces and distribute it to N worker nodes. But going with MapReduce doesn’t solve that scalability problem — it just means that you will have a hard time defining a useful map or reduce operation, or you will have to change the question into something easier to answer.

Continue reading

Good Error Recovery is Hard, So Use PostgreSQL

My last post about the Linux OOM Killer got a lot of attention, and I don’t intend to re-open the whole discussion. However, a couple themes developed that I think warrant some further discussion. One of those themes is that “error recovery is hard,” and that’s absolutely true. It’s so true that many of the kernel developers seemed to develop a more extreme version: “error recovery is infeasible” (see this post) — but that is not true.

PostgreSQL is a great example: it gets error recovery right. I didn’t say it’s perfect, but graceful degradation is designed into postgres from the start, and if it doesn’t degrade gracefully, that’s probably a bug. I am saying this as a person who experienced a not-so-graceful PostgreSQL problem first-hand back in 2006 that was due to an OOM condition on FreeBSD (and yes, malloc() returned NULL). But:

  • The bug occurred under fairly strange circumstances where very small allocations ate the memory very completely, leaving no room (not even a few dozen bytes), during a transaction that included extensive DDL (note that most database systems can’t even do transactional DDL at all).
  • The bug was fixed!

A database system getting error recovery right is no accident, fluke, or statistical outlier. Database systems are meant to get this right so that applications do not have to do it. In other words, write applications against PostgreSQL (or another robust relational database system), and you get error recovery for free.

The application may be performing a complex series of updates or state changes, and trying to ensure graceful error recovery without a database system to help may be nearly impossible. But by simply wrapping those operations in a single transaction, the application only needs to know how to handle a few classes of errors, and will always see consistent data afterward.

Moral of the story: if you’re not building on top of a robust DBMS, and you have any significant amount of state to manage, then you are not getting error recovery right (unless you put in the serious analysis required). Similar problems apply to those using a DBMS like a filesystem, with no constraints (to prevent corruption) and related operations split across transactional boundaries. So: use PostgreSQL, use transactions, and use constraints.

Temporal Keys, Part 2

In the last article, I argued that:

  • A schedule conflict is a typical business problem.
  • The later you try to resolve a schedule conflict, the more costly it is to resolve.
  • In particular, there is a big jump in the cost the moment after conflicting data is recorded.
  • Therefore, it’s best for the DBMS itself to enforce the constraint, because only the DBMS can avoid the conflict effectively before the conflict is recorded.

Then, I opened up a discussion to see how people are dealing with these schedule conflicts. In the comments I received at the end of the article, as well as other anecdotes from conferences, user groups, mailing lists, and my own experience, the solutions fall into a few categories:

Continue reading

Temporal Keys, Part 1

“Schedule conflict” — it’s one of the simplest and most common constraints for business or any other organization. One person cannot be in two places at the same time; and in many cases a only a single person can use a given resource at any time.

Does your database system know anything about a schedule conflict? Should it?

Constraints are always enforced at some point, it’s just a matter of when, how, and the cost of correcting the situation.

Continue reading

PostgreSQL WEST and Temporal Databases

I’ve been interested in temporal data and relational databases for quite some time. There are going to be at least two people talking about temporal data at PostgreSQL WEST in Seattle: Scott Bailey and me. See the talk descriptions.

In the past, I’ve worked on a temporal extension to PostgreSQL that implements the PERIOD data type. This is a data type that offers both a definite beginning and a definite end time, which is important for describing things that happen over a period of time, rather than instantaneously. Trying to use separate attributes for “start” and “end” is bad for a number of reasons, and will certainly be addressed in a subsequent blog entry. For now, I’ll just say that I believe the PERIOD data type is fundamentally important for handling all kinds of time data, which I believe is a common problem.

At WEST, I’ll be presenting my progress on temporal keys. Temporal keys are used to prevent overlapping periods of time — a schedule conflict — by using an index and following the same concurrent behavior as UNIQUE with minimal performance cost (one extra index search, to be precise).

Temporal keys cannot be expressed in PostgreSQL 8.4, unless you resort to triggers and a full table lock (ouch!). So, additional backend support is required. This is accomplished in my patch for operator exclusion constraints, which are a more general way of using arbitrary operators and index searches to enforce a constraint. I plan to do what’s required for the patch to be accepted in PostgreSQL 8.5.

Temporal modeling is a common problem. It seems like almost every PostgreSQL conference has had at least one talk on the matter, so we know there is some demand for improvement. If you’re interested, I hope you come to WEST and chat with Scott or I, and let’s see if we can come up with some real solutions.

None, nil, Nothing, undef, NA, and SQL NULL

In my last post, Why DBMSs are so complex, I raised the issue of type mismatches between the application language and the DBMS.

Type matching between the DBMS and the application is as important as types themselves for successful application development. If a type behaves one way in the DBMS, and a “similar” type behaves slightly differently in the application, that can only cause confusion. And it’s a source of unnecessary awkwardness: you already need to define the types that suit your business best in one place, why do you need to redefine them somewhere else, based on a different basic type system?

Continue reading