Good Error Recovery is Hard, So Use PostgreSQL

My last post about the Linux OOM Killer got a lot of attention, and I don’t intend to re-open the whole discussion. However, a couple themes developed that I think warrant some further discussion. One of those themes is that “error recovery is hard,” and that’s absolutely true. It’s so true that many of the kernel developers seemed to develop a more extreme version: “error recovery is infeasible” (see this post) — but that is not true.

PostgreSQL is a great example: it gets error recovery right. I didn’t say it’s perfect, but graceful degradation is designed into postgres from the start, and if it doesn’t degrade gracefully, that’s probably a bug. I am saying this as a person who experienced a not-so-graceful PostgreSQL problem first-hand back in 2006 that was due to an OOM condition on FreeBSD (and yes, malloc() returned NULL). But:

The bug occurred under fairly strange circumstances where very small allocations ate the memory very completely, leaving no room (not even a few dozen bytes), during a transaction that included extensive DDL (note that most database systems can’t even do transactional DDL at all).
The bug was fixed!

A database system getting error recovery right is no accident, fluke, or statistical outlier. Database systems are meant to get this right so that applications do not have to do it. In other words, write applications against PostgreSQL (or another robust relational database system), and you get error recovery for free.

The application may be performing a complex series of updates or state changes, and trying to ensure graceful error recovery without a database system to help may be nearly impossible. But by simply wrapping those operations in a single transaction, the application only needs to know how to handle a few classes of errors, and will always see consistent data afterward.

Moral of the story: if you’re not building on top of a robust DBMS, and you have any significant amount of state to manage, then you are not getting error recovery right (unless you put in the serious analysis required). Similar problems apply to those using a DBMS like a filesystem, with no constraints (to prevent corruption) and related operations split across transactional boundaries. So: use PostgreSQL, use transactions, and use constraints.

6 thoughts on “Good Error Recovery is Hard, So Use PostgreSQL”

@retard

hm, maybe Mr. Davis should enable comment moderation…

Raoul Duke on December 24, 2009 at 12:17 said:

oh now what i was referring to has disappeared, so i guess there is some editing going on.

re: the actual meat of the story: i used mysql before (probably like 3 years ago now) and at the time it really did seem to suck when it came to error recovery e.g. rollback. it amazes me that people want to use a rdbms and yet don’t seem to want it to be solid.
Jeff Davis on December 24, 2009 at 12:58 said:

Yes, I removed the offensive comment. Until now, offensive comments were not a problem. If it becomes a problem, I’ll enable moderation.

I totally agree with you on this one. I’ve been using Pgsql for a decade now and I’ve always loved the fact that it doesn’t die on me. Ever. Even if there’s a power outage during updates to the database, disk space runs out, OOM or whatever. Never had a single fault in the databases, even in the ones running tens of gigabytes of data and being updated while these conditions appear.

Naturally the servers are one thing, in production use running out of disk space or sudden power outage without memory backup in the disk controllers mustn’t happen (but they nevertheless do sometimes), but on the test systems these have happened so many times.

The only way I’ve gotten Pgsql to corrupt itself has been on OS X. fsync off, running long and hard updates and then having the OS or something crash and then it’s bye bye for that database. Dunno if the OS X port is not that good or if it’s just that HFS+ doesn’t really cut it. Fortunately I don’t have to use OS X, I can work with decent OSs

Jeff Davis on December 29, 2009 at 18:58 said:

“fsync = off” can (and probably will) cause corruption on any platform, not just OSX. It’s well-documented that it’s dangerous.

PostgreSQL + OSX should be fine.

I completely agree with you. I’m using PostgreSQL Maestro 10.12 . Here is the link http://bytesland.com/view/PostgreSQL-Maestro-10-12/3 from which I downloaded it. In the majority of cases it works almost perfectly.

Raoul Duke on December 24, 2009 at 12:16 said:

@retard

hm, maybe Mr. Davis should enable comment moderation…
- Raoul Duke on December 24, 2009 at 12:17 said:
  
  oh now what i was referring to has disappeared, so i guess there is some editing going on.
  
  re: the actual meat of the story: i used mysql before (probably like 3 years ago now) and at the time it really did seem to suck when it came to error recovery e.g. rollback. it amazes me that people want to use a rdbms and yet don’t seem to want it to be solid.
- Jeff Davis on December 24, 2009 at 12:58 said:
  
  Yes, I removed the offensive comment. Until now, offensive comments were not a problem. If it becomes a problem, I’ll enable moderation.
Symbiatch on December 27, 2009 at 21:06 said:

I totally agree with you on this one. I’ve been using Pgsql for a decade now and I’ve always loved the fact that it doesn’t die on me. Ever. Even if there’s a power outage during updates to the database, disk space runs out, OOM or whatever. Never had a single fault in the databases, even in the ones running tens of gigabytes of data and being updated while these conditions appear.

Naturally the servers are one thing, in production use running out of disk space or sudden power outage without memory backup in the disk controllers mustn’t happen (but they nevertheless do sometimes), but on the test systems these have happened so many times.

The only way I’ve gotten Pgsql to corrupt itself has been on OS X. fsync off, running long and hard updates and then having the OS or something crash and then it’s bye bye for that database. Dunno if the OS X port is not that good or if it’s just that HFS+ doesn’t really cut it. Fortunately I don’t have to use OS X, I can work with decent OSs
- Jeff Davis on December 29, 2009 at 18:58 said:
  
  “fsync = off” can (and probably will) cause corruption on any platform, not just OSX. It’s well-documented that it’s dangerous.
  
  PostgreSQL + OSX should be fine.
Nick on April 10, 2011 at 03:42 said:

I completely agree with you. I’m using PostgreSQL Maestro 10.12 . Here is the link http://bytesland.com/view/PostgreSQL-Maestro-10-12/3 from which I downloaded it. In the majority of cases it works almost perfectly.

Experimental Thoughts

Ideas on Databases, Logic, and Language by Jeff Davis

Good Error Recovery is Hard, So Use PostgreSQL

6 thoughts on “Good Error Recovery is Hard, So Use PostgreSQL”