Good Error Recovery is Hard, So Use PostgreSQL

My last post about the Linux OOM Killer got a lot of attention, and I don’t intend to re-open the whole discussion. However, a couple themes developed that I think warrant some further discussion. One of those themes is that “error recovery is hard,” and that’s absolutely true. It’s so true that many of the kernel developers seemed to develop a more extreme version: “error recovery is infeasible” (see this post) — but that is not true.

PostgreSQL is a great example: it gets error recovery right. I didn’t say it’s perfect, but graceful degradation is designed into postgres from the start, and if it doesn’t degrade gracefully, that’s probably a bug. I am saying this as a person who experienced a not-so-graceful PostgreSQL problem first-hand back in 2006 that was due to an OOM condition on FreeBSD (and yes, malloc() returned NULL). But:

  • The bug occurred under fairly strange circumstances where very small allocations ate the memory very completely, leaving no room (not even a few dozen bytes), during a transaction that included extensive DDL (note that most database systems can’t even do transactional DDL at all).
  • The bug was fixed!

A database system getting error recovery right is no accident, fluke, or statistical outlier. Database systems are meant to get this right so that applications do not have to do it. In other words, write applications against PostgreSQL (or another robust relational database system), and you get error recovery for free.

The application may be performing a complex series of updates or state changes, and trying to ensure graceful error recovery without a database system to help may be nearly impossible. But by simply wrapping those operations in a single transaction, the application only needs to know how to handle a few classes of errors, and will always see consistent data afterward.

Moral of the story: if you’re not building on top of a robust DBMS, and you have any significant amount of state to manage, then you are not getting error recovery right (unless you put in the serious analysis required). Similar problems apply to those using a DBMS like a filesystem, with no constraints (to prevent corruption) and related operations split across transactional boundaries. So: use PostgreSQL, use transactions, and use constraints.

Linux OOM Killer

The Linux OOM Killer heuristic can be summed up as:

  1. Run out of memory.
  2. Kill PostgreSQL.
  3. Look for processes that might be using too much memory, and kill them, hopefully freeing memory.

Notice that step #2 is independent of factors like:

  • Is PostgreSQL consuming a significant share of the memory resources?
  • Will killing PostgreSQL alleviate any memory pressure at all? Continue reading