The Linux OOM Killer heuristic can be summed up as:
- Run out of memory.
- Kill PostgreSQL.
- Look for processes that might be using too much memory, and kill them, hopefully freeing memory.
Notice that step #2 is independent of factors like:
- Is PostgreSQL consuming a significant share of the memory resources?
- Will killing PostgreSQL alleviate any memory pressure at all?
The reason for this is because linux, when under memory pressure, invokes the
badness() function to determine which process to kill. One of the things that the function counts against a process is the total VM size of the process (no surprise), plus half of the total VM size of all of the children. That sounds reasonable — wait a minute, what about shared memory?
For every connection, PostgreSQL makes a child process, which shares memory with the parent; and if it is a big postgres instance, that can be a significant amount of memory. But the badness function counts the same shared memory against the parent 1+N/2 times, where N is the number of connections open! For example, if you have shared_buffers set to 1GB on an 8GB machine, and have 20 connections open, then the linux badness function thinks that PostgreSQL is using about 11GB of memory, when it is actually using 1GB!
It gets worse: killing a process does not free shared memory. And there are already administrator-controlled limits that start off fairly low (32MB, I think), so the administrator would have to make a mistake in order for shared memory to be the problem.
And it gets even worse: linux makes very little attempt to avoid getting into a bad situation in the first place. On most operating systems, if you ask for too much memory,
malloc() will eventually fail, which will probably cause the application to terminate; or if the application is written to handle such conditions (like PostgreSQL), it will gracefully degrade. On linux, by default,
brk() (the system call underlying
malloc()) will happily return success in almost every situation, and the first hint that your application gets of a problem is a SIGKILL.
I have heard many excuses for this appalling set of behaviors, but none of them are satisfactory to me. Here is one explanation. Notice the implicit assumption is that the most important thing that a user may be running is a text editor, and it should just autosave every once in a while to avoid problems. Keeping processes actually running is apparently not important to Linux. The next general philosophy present in the email is that applications are stupid, and they will never get the error paths or rollback right, but PostgreSQL seems to do that just fine (it needs very little memory to roll back, and tries to free memory before that to make it less likely to run into a problem). And the author seems to think that the first hint of a system problem should be a SIGKILL based on some heuristic (which, in the case of linux, is seriously flawed, as I pointed out above).
Among other justifications are:
- “You can configure linux to prevent this problem.” Sounds like MySQL, where you have to declare your tables to be transaction-safe. Why not safe by default, and configure for performance?
- “Modern systems overcommit, and there are always edge cases where you get into problems.” OK, so keep them as edge cases, and at least attempt to avoid problem situations. If some bizarre set of circumstances, or maybe a malicious process, is able to cause memory problems on your system, then so be it. But well-behaved processes should get some indication of trouble prior to SIGKILL. And the real problem processes should be identified with some accuracy and reasonable intelligence (i.e. recognize that shared memory can’t be freed by the OOM Killer, and therefore shouldn’t be considered).
I have been complaining about this insanity for years:
- http://email@example.com/msg120596.html (2007)
- http://lkml.org/lkml/2007/2/9/275 (2007)
- http://lkml.org/lkml/2008/2/5/495 (2008)
It makes me happy that other free software operating systems are still under active development, as illustrated by the recent release of FreeBSD 8.0. I am not saying FreeBSD is better in every way than Linux, but I do believe that competition is important even in the free software world.
This post is obviously a rant, and an ill-informed one, at that. I am no expert on memory management, so let me know if I am way off-base. However, I have never heard reasonable justifications for these things in response to my emails to lkml, in personal discussions with kernel hackers or other knowledgeable Linux folk, or anywhere else. If you do have a reasonable explanation, I am more than willing to listen.