Good Error Recovery is Hard, So Use PostgreSQL

Jeff Davis — Thu, 24 Dec 2009 04:29:04 +0000

My last post about the Linux OOM Killer got a lot of attention, and I don’t intend to re-open the whole discussion. However, a couple themes developed that I think warrant some further discussion. One of those themes is that “error recovery is hard,” and that’s absolutely true. It’s so true that many of the kernel developers seemed to develop a more extreme version: “error recovery is infeasible” (see this post) — but that is not true.

PostgreSQL is a great example: it gets error recovery right. I didn’t say it’s perfect, but graceful degradation is designed into postgres from the start, and if it doesn’t degrade gracefully, that’s probably a bug. I am saying this as a person who experienced a not-so-graceful PostgreSQL problem first-hand back in 2006 that was due to an OOM condition on FreeBSD (and yes, malloc() returned NULL). But:

The bug occurred under fairly strange circumstances where very small allocations ate the memory very completely, leaving no room (not even a few dozen bytes), during a transaction that included extensive DDL (note that most database systems can’t even do transactional DDL at all).
The bug was fixed!

A database system getting error recovery right is no accident, fluke, or statistical outlier. Database systems are meant to get this right so that applications do not have to do it. In other words, write applications against PostgreSQL (or another robust relational database system), and you get error recovery for free.

The application may be performing a complex series of updates or state changes, and trying to ensure graceful error recovery without a database system to help may be nearly impossible. But by simply wrapping those operations in a single transaction, the application only needs to know how to handle a few classes of errors, and will always see consistent data afterward.

Moral of the story: if you’re not building on top of a robust DBMS, and you have any significant amount of state to manage, then you are not getting error recovery right (unless you put in the serious analysis required). Similar problems apply to those using a DBMS like a filesystem, with no constraints (to prevent corruption) and related operations split across transactional boundaries. So: use PostgreSQL, use transactions, and use constraints.

Linux OOM Killer

Jeff Davis — Mon, 30 Nov 2009 04:07:20 +0000

The Linux OOM Killer heuristic can be summed up as:

Run out of memory.
Kill PostgreSQL.
Look for processes that might be using too much memory, and kill them, hopefully freeing memory.

Notice that step #2 is independent of factors like:

Is PostgreSQL consuming a significant share of the memory resources?
Will killing PostgreSQL alleviate any memory pressure at all?

The reason for this is because linux, when under memory pressure, invokes the badness() function to determine which process to kill. One of the things that the function counts against a process is the total VM size of the process (no surprise), plus half of the total VM size of all of the children. That sounds reasonable — wait a minute, what about shared memory?

For every connection, PostgreSQL makes a child process, which shares memory with the parent; and if it is a big postgres instance, that can be a significant amount of memory. But the badness function counts the same shared memory against the parent 1+N/2 times, where N is the number of connections open! For example, if you have shared_buffers set to 1GB on an 8GB machine, and have 20 connections open, then the linux badness function thinks that PostgreSQL is using about 11GB of memory, when it is actually using 1GB!

It gets worse: killing a process does not free shared memory. And there are already administrator-controlled limits that start off fairly low (32MB, I think), so the administrator would have to make a mistake in order for shared memory to be the problem.

And it gets even worse: linux makes very little attempt to avoid getting into a bad situation in the first place. On most operating systems, if you ask for too much memory, malloc() will eventually fail, which will probably cause the application to terminate; or if the application is written to handle such conditions (like PostgreSQL), it will gracefully degrade. On linux, by default, brk() (the system call underlying malloc()) will happily return success in almost every situation, and the first hint that your application gets of a problem is a SIGKILL.

I have heard many excuses for this appalling set of behaviors, but none of them are satisfactory to me. Here is one explanation. Notice the implicit assumption is that the most important thing that a user may be running is a text editor, and it should just autosave every once in a while to avoid problems. Keeping processes actually running is apparently not important to Linux. The next general philosophy present in the email is that applications are stupid, and they will never get the error paths or rollback right, but PostgreSQL seems to do that just fine (it needs very little memory to roll back, and tries to free memory before that to make it less likely to run into a problem). And the author seems to think that the first hint of a system problem should be a SIGKILL based on some heuristic (which, in the case of linux, is seriously flawed, as I pointed out above).

Among other justifications are:

“You can configure linux to prevent this problem.” Sounds like MySQL, where you have to declare your tables to be transaction-safe. Why not safe by default, and configure for performance?
“Modern systems overcommit, and there are always edge cases where you get into problems.” OK, so keep them as edge cases, and at least attempt to avoid problem situations. If some bizarre set of circumstances, or maybe a malicious process, is able to cause memory problems on your system, then so be it. But well-behaved processes should get some indication of trouble prior to SIGKILL. And the real problem processes should be identified with some accuracy and reasonable intelligence (i.e. recognize that shared memory can’t be freed by the OOM Killer, and therefore shouldn’t be considered).

I have been complaining about this insanity for years:

http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg120596.html (2007)
http://lkml.org/lkml/2007/2/9/275 (2007)
http://lkml.org/lkml/2008/2/5/495 (2008)

It makes me happy that other free software operating systems are still under active development, as illustrated by the recent release of FreeBSD 8.0. I am not saying FreeBSD is better in every way than Linux, but I do believe that competition is important even in the free software world.

This post is obviously a rant, and an ill-informed one, at that. I am no expert on memory management, so let me know if I am way off-base. However, I have never heard reasonable justifications for these things in response to my emails to lkml, in personal discussions with kernel hackers or other knowledgeable Linux folk, or anywhere else. If you do have a reasonable explanation, I am more than willing to listen.

Experimental Thoughts » linux

Good Error Recovery is Hard, So Use PostgreSQL

Linux OOM Killer