The Linux OOM Killer heuristic can be summed up as:
- Run out of memory.
- Kill PostgreSQL.
- Look for processes that might be using too much memory, and kill them, hopefully freeing memory.
Notice that step #2 is independent of factors like:
- Is PostgreSQL consuming a significant share of the memory resources?
- Will killing PostgreSQL alleviate any memory pressure at all?
The reason for this is because linux, when under memory pressure, invokes the badness()
function to determine which process to kill. One of the things that the function counts against a process is the total VM size of the process (no surprise), plus half of the total VM size of all of the children. That sounds reasonable — wait a minute, what about shared memory?
For every connection, PostgreSQL makes a child process, which shares memory with the parent; and if it is a big postgres instance, that can be a significant amount of memory. But the badness function counts the same shared memory against the parent 1+N/2 times, where N is the number of connections open! For example, if you have shared_buffers set to 1GB on an 8GB machine, and have 20 connections open, then the linux badness function thinks that PostgreSQL is using about 11GB of memory, when it is actually using 1GB!
It gets worse: killing a process does not free shared memory. And there are already administrator-controlled limits that start off fairly low (32MB, I think), so the administrator would have to make a mistake in order for shared memory to be the problem.
And it gets even worse: linux makes very little attempt to avoid getting into a bad situation in the first place. On most operating systems, if you ask for too much memory, malloc()
will eventually fail, which will probably cause the application to terminate; or if the application is written to handle such conditions (like PostgreSQL), it will gracefully degrade. On linux, by default, brk()
(the system call underlying malloc()
) will happily return success in almost every situation, and the first hint that your application gets of a problem is a SIGKILL.
I have heard many excuses for this appalling set of behaviors, but none of them are satisfactory to me. Here is one explanation. Notice the implicit assumption is that the most important thing that a user may be running is a text editor, and it should just autosave every once in a while to avoid problems. Keeping processes actually running is apparently not important to Linux. The next general philosophy present in the email is that applications are stupid, and they will never get the error paths or rollback right, but PostgreSQL seems to do that just fine (it needs very little memory to roll back, and tries to free memory before that to make it less likely to run into a problem). And the author seems to think that the first hint of a system problem should be a SIGKILL based on some heuristic (which, in the case of linux, is seriously flawed, as I pointed out above).
Among other justifications are:
- “You can configure linux to prevent this problem.” Sounds like MySQL, where you have to declare your tables to be transaction-safe. Why not safe by default, and configure for performance?
- “Modern systems overcommit, and there are always edge cases where you get into problems.” OK, so keep them as edge cases, and at least attempt to avoid problem situations. If some bizarre set of circumstances, or maybe a malicious process, is able to cause memory problems on your system, then so be it. But well-behaved processes should get some indication of trouble prior to SIGKILL. And the real problem processes should be identified with some accuracy and reasonable intelligence (i.e. recognize that shared memory can’t be freed by the OOM Killer, and therefore shouldn’t be considered).
I have been complaining about this insanity for years:
- http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg120596.html (2007)
- http://lkml.org/lkml/2007/2/9/275 (2007)
- http://lkml.org/lkml/2008/2/5/495 (2008)
It makes me happy that other free software operating systems are still under active development, as illustrated by the recent release of FreeBSD 8.0. I am not saying FreeBSD is better in every way than Linux, but I do believe that competition is important even in the free software world.
This post is obviously a rant, and an ill-informed one, at that. I am no expert on memory management, so let me know if I am way off-base. However, I have never heard reasonable justifications for these things in response to my emails to lkml, in personal discussions with kernel hackers or other knowledgeable Linux folk, or anywhere else. If you do have a reasonable explanation, I am more than willing to listen.
Jeff,
I am mo memory management expert, but I wonder what response you get when you post this on LKML:
Subject: How can PostgreSQL avoid the OOM killer?
The Linux OOM Killer heuristic can be summed up as:
1 Run out of memory.
2 Kill PostgreSQL.
3 Look for processes that might be using too much memory, and kill them, hopefully freeing memory.
The reason PostgreSQL is first on the OOM hitlist is because OOM looks at the shared memory mapped by Pg plus half the shared memory mapped by its kids. (even though it is all the same chunk of memory).
How do we have to change PostgreSQL to avoid being killed (unfairly).
Right now I have to assume that LKML doesn’t care, because I’ve tried.
My only hope is to embarrass them into caring. They are smart people, so as soon as they do care I know it will be improved, but right now, they don’t.
Sometimes we just have to step back and say: did operating systems always do this? And the answer, of course, is no. With all the improvements in hardware, why are we taking steps _backward_ in safety for some marginal performance benefit?
Huh? If this bothers you so much, why don’t you just *fix it*. If you complain on LKML, but then show up with a patch that makes OOMKiller correctly account for shared memory, I bet they’d take it, then you’d not only have the problem solved, but you could say that you were the guy who fixed it.
Tried that: http://lkml.org/lkml/2008/2/5/495
For some reason I can’t find the root of that thread, but it’s an email from me with the patch attached (as the subject suggests). It’s a very simple patch.
You know, I reported this exact same issue with the OOM killer and Oracle.
Eight years ago.
http://lkml.indiana.edu/hypermail/linux/kernel/0103.2/0812.html
So glad to see Linux making progress.
I’m glad I’m not completely crazy. Does Oracle suffer from the shared memory overcounting as well?
Yes, the OOM killer has been a source of so much pain for so many people for so many years that the people who defend it only make themselves look foolish. If it’s any consolation, there is a way to turn it off. Just “echo 2 > /proc/sys/vm/overcommit_memory” and it will turn off memory overcommit; this will give you sane “return errors instead of overcommitting” behavior, and without overcommit the OOM killer will never kick in. The downside is that this only the total memory used by all processes to be swap plus a percentage of physical memory, where the default percentage is 50 but can be overridden by writing into /proc/sys/vm/overcommit_ratio. It’s really unfortunate that 2/90 aren’t the defaults, but at least they’re settable.
Having run out of memory on both Linux and FreeBSD (though not for a while on either OS), my experience is that FreeBSD will do a lot of swapping, then be fairly kind to you about the whole issue. Linux is quite brutal though, “You know this process you had running? It’s gone”.
Andrea Arcangeli, who wrote much of this code in the first place, submitted a patch to fix the problem some time ago:
http://marc.info/?l=linux-mm&m=119977937126925
But it seems to have gone nowhere.
Know what’s really funny? Andrea himself knows the corner-cases where the OOM killer kills the wrong thing are so numerous that he suggests you want to turn it off on servers: http://kerneltrap.org/node/3148
oom_adj – Adjust the oom-killer score and oom_score – display the current oom-killer score are the two tunables you’d likely want to tune.
I think applications can even tune their own score too. Maybe time for postgres to fix it.
A platform specific hack should not be required to avoid being unfairly singled out for termination in most cases.
Perhaps a platform (or distributor) specific hack for making such termination impossible is *acceptable*…after all, any system with overcommit can OOM, so there may as well be a policy when it happens….
That falls into the category of “you can configure linux to prevent this problem”. That’s true, but I just don’t buy that argument. You shouldn’t have to configure the operating system to not kill your processes. Properly written applications shouldn’t need to add linux-specific code to ask the operating system not to kill it.
So presumably you do no other platform specific optimizations when you install postgres? You simply install FreeBSD with all default options, then install postgres and presto, production ready system. Impressive.
Think if everything required tuning just to get it to work correctly: processors, gcc, etc.
Is it really too much to ask that the default behavior be sane? Think for a minute about how much you rely on sane defaults. Not just in computing, but other products and everyday social interactions.
If I put gas in my car, it goes. It doesn’t randomly kill the engine on the freeway because I turned on the air conditioning. In fact, there’s very little that I need to know about the inner workings of a car in order to get the basic functionality (I may choose to learn more, but some do not). That’s called progress.
By telling us we need to learn about memory management in order to be a *user* of the OS, linux is putting itself at the center of the universe. In reality, most people would not notice if you replaced their linux kernel with freebsd.
Then the distributors can complain But end users that install Postgres will be happier and less surprised and the region of pain will shrink a little bit to packages. Sub-ideal, but progress.
Just turn OOM off and fix your memory leaks. There is nothing else you can do really. I run with hardware watchdogs to catch everything. If the system runs of of mem, then bad stuff will happen and they’ll restart everything to get back to a sane state. As a side note here is a tool to list RAM usage by processes taking shared mem into account:
http://www.pixelbeat.org/scripts/ps_mem.py
Get a Linux box and a FreeBSD box. Get PostgreSQL running on both with a couple GB of shared buffers, start a lot of connections. Then, start a process on each which slowly eats memory such that it will consume all memory after an hour or so.
We both know bad things happen on Linux. But tell me what horrible things happen on FreeBSD, because I have not witnessed them.
If FreeBSD handles it fine then that’s cool (albeit beside the point). PostgreSQL will exit or is in a detectable bad state and can be restarted.
There was a LWN article recently about someone working on the OOM killer to improve process selection, so things may get better soon.
I must say, however, that you’re being unduly hard on Linux. Every UNIX implementation of which I’m aware has a sbrk() system call which succeeds without regard to memory pressure and an OOM killer which behaves stupidly in one scenario or another. FreeBSD 8.0′s OOM killer, for example, assumes the memory footprint of a process for which it can’t immediately acquire PROC_LOCK is zero; in a particular system I’ve been working on, the run-away application has a highly-contended PROC_LOCK, so the OOM killer basically always picks the largest of the remaining processes. In my case, that’s always /sbin/init–yay, kernel panic.
The sbrk() and mmap() system calls–malloc() maps onto a combination of the two, these days–return success in almost every situation because on modern UNIX implementations, they allocate address space, *not* memory. The implementations on which malloc() demonstrably *does* return NULL probably have RLIMIT_DATA set to some “reasonable” value–most Linux distros set it to inifinity–but it is important to note that it’s pretty trivial to construct 1) a situation where RLIMIT_DATA causes malloc() to return NULL even though there’s plenty of remaining memory and 2) a situation where RLIMIT_DATA (even a very low limit) hasn’t been hit but the system still runs out of memory and triggers the OOM killer.
I mentioned above that malloc() allocates address space. The way you allocate actual memory is, of course, by touching the address space to cause a page fault. A naive way to remove (part of) the OOM problem would be to have sbrk() and mmap() allocate both address space and memory, making the page faults unnecessary, but that solution is unacceptibly wasteful in practice. First, the heap algorithms in a typical malloc() implementation–say, jemalloc from FreeBSD–impose a correlation between address and allocation size in order to keep virtual memory fragmentation under control. That is, 8-byte malloc()s are much more likely to be near other 8-byte malloc()s than 12K mallocs(), for example. Even on a 64-bit machine, you can’t just ignore VA fragmentation; there’s still a finite amount of it, and using it sparsely costs both memory and TLB entries. Second, even without also allocating memory, the sbrk() and mmap() system calls are expensive. So the reality is that a malloc() implementation will tend spread different sized allocations out in the address space because of the different size pools but must request large (contiguous) chunks of address space from sbrk() and mmap()–address space it has no way of knowing whether it will actually use, since it can’t predict what malloc() calls will be made in the future–in order to amortize the cost. So maybe you can hope for some kind of SIGBACONATOR that puts you on notice that you’d better go on a diet or face the OOM killer, but malloc() returning NULL is basically out of the question; it can’t determine whether there’s enough free memory or not in advance because communication with the OS is far too costly and the page fault can’t affect the return value because it happens *after* malloc() returns!
Along those lines: you mentioned that applications written to handle OOM conditions are able to gracefully degrade. That isn’t actually true in practice–or at least, the likelihood that graceful degradation in OOM conditions is impossible is sufficiently high that the code to implement it is almost never worth the effort, unless you are exercising complete control over a system (to the extent that you’re willing to do non-portable things like setting oom_adj and oom_score). The reason is that, while an application can certainly avoid calling malloc() while trying to recover, it is much much more difficult to avoid page faulting because it needs another page in some data buffer it’s already allocated or another page of stack–presumably, performing the recovery requires calling one or more functions. (Signal handling requires stack space too, of course.) Or because it needs another page of code–after all, the no-more-memory code path by definition won’t have been executed before, so it probably isn’t even resident and will have to be paged in from disk. If you’re using C++, maybe the tables used by the compiler to unwind the stack when throwing an exception aren’t resident. In either of those situations any application that needs memory for any reason deadlocks unless the OOM killer runs to free some physical memory and/or swap space.
Thank you for the informative reply.
1. “the likelihood that graceful degradation in OOM conditions is impossible is sufficiently high that the code to implement it is almost never worth the effort”
PostgreSQL gracefully degrades quite well. ROLLBACK requires very little memory even if a lot happened during the transactions. Before it does anything else, it frees other memory, making it very likely to succeed.
In theory maybe it needs an extra page on the stack that has never been touched before, and that could cause a problem. Or maybe that can happen with the heap in some bizarre situations. But generally, ROLLBACK doesn’t need to touch any new pages that haven’t been touched before (I haven’t proven this, of course, but it would certainly be a bizarre situation if it were not true). It gracefully degrades.
2. “no-more-memory code path by definition won’t have been executed before”
The ROLLBACK code almost certainly would be paged in, and PostgreSQL is written in C. You’re approaching this from a very theoretical standpoint, but these don’t seem like practical problems in the case of PostgreSQL. It has had problems on this path before (and those were addressed by freeing memory carefully first), but for the most part it degrades nicely.
3. You didn’t address the shared memory counting. Killing a process doesn’t free shared memory, and yet they count it anyway (and ignored my patch to subtract the shared memory before counting up the VMs).
4. I see what you’re saying about brk()/sbrk() allocating VM space, not memory. Even if it’s a crude tool, can’t it be used to guess whether the system is potentially in trouble? Processes that allocate a lot of VM probably plan to use a significant amount of it.
Or maybe it’s possible to have a cheap syscall that says “I already have this page in my VM. Can you allocate the memory now?” and it could fail, and malloc could use that syscall rather than waiting for the application to cause a trap.
After thinking about VM space versus allocated memory, I believe this is solvable. Malloc knows how much the application requested, and just needs to tell the OS via some syscall within some margin of error. Even if the syscall is slow, surely it could report every 1024 pages or so.
If some application bypasses malloc and doesn’t inform the OS what it’s doing, it deserves what it gets.
Actually, malloc has no idea how much memory within the heap is in use and won’t page fault: you’re ignoring fork(). In any case the kernel already knows exactly how much memory the application is using within the margin of error maintained by the malloc() implementation, so you don’t need to tell it with a syscall. It just doesn’t have much of an opportunity to act on that information until there’s already a pending page fault it can’t service without killing something first.
Also, don’t forget things like dlopen(), which makes a large shared mapping and then sparsely copy-on-write faults individual pages as it performs relocations: the address space based guess you proposed in (4) would trip up on that, too.
“kernel already knows exactly how much memory the application is using”
Knows how much the application is using, or knows how much malloc has told the application that it can use?
I thought the problem was VM space versus actually allocated memory.
If, in addition to allocating VM space, malloc explicitly asked the OS to reserve memory (say, every 1024 pages or so to amortize the cost of a syscall) before returning a valid pointer, then the OS would have an opportunity to say “there’s no way I can reserve those pages for you”, and malloc could return NULL.
I understand there are still COW issues with fork and dlopen, but I’m not looking for a perfect solution. In those cases the OS has a pretty good idea how much it has extended itself (number of processes sharing that page that might try to write).
The main problem is with malloc, where (as you say) there is such a huge disconnect between the VM size and the memory that the application has requested that the OS doesn’t know what kind of trouble its in early enough, and even if it does, malloc has no way of knowing the OS may not be able to satisfy the requests.
Sorry, some sloppy thinking on my part. What I said is only true if the application is free()ing and never malloc()ing, and the kernel is off by whatever amortization margin the malloc() implementation uses (3MB for the typical FreeBSD 8.0 configuration, as I mentioned earlier).
Your syscall idea is like an mlock() that reserves either or both of RAM and swap instead of only RAM, or like a MAP_RESERVE flag for Linux’s mmap() to compliment the MAP_NORESERVE flag. I think there’s some merit to that suggestion.
The only other thing I’d note is that heap implementations typically give each thread its own arena so as to minimize lock contention on SMP systems. Each of them would need an independent amortized syscall; what I’m getting at is that if this were a workable solution in general, /proc/sys/vm/overcommit_memory==2 would be the default.
Of course, it’s probably a perfectly workable default for people running Postgres
In response to (1): I’m sure Postgres does degrade quite well on a system with overcommit disabled, but you’re greatly underestimating how difficult it is to degrade gracefully in an OOM condition when memory is overcommitted. Consider that, by the time an OOM condition is reached, the paging daemon has been awake long enough to give every single memory access attempted by Postgres a non-zero risk of page faulting–code, data, heap, stack, anything. Postgres might squeak by because the memory it needs to touch in order to ROLLBACK is probably last in the pager’s LRU queue, but it does so on dumb luck, with a probability that has nothing to do with its own code’s carefulness and everything to do with what else is happening in the system. The “bizarre situation” is actually the common case, because once the paging deamon awakens having touched a page in the past gives you little protection.
By the way, calling free() doesn’t actually free memory just as calling malloc() doesn’t actually allocate it. To actually free physical pages, a malloc() implementation must unmap the address space to which they are mapped using sbrk() or munmap(), or must madvise(MADV_FREE) it. Both of these operations are quite expensive, so malloc() implementations amortize the cost. The default configuration of FreeBSD’s jemalloc, for example, unmaps in 1M (contiguous) chunks and lets 2M (discontiguous) accummulate before calling madvise().
In response to (2) I’ll stress again that degrading gracefully after hitting an RLIMIT_* is an entirely different beast than degrading gracefully after the paging daemon has injected non-determinism into your application. And of course it’s worse than that; you can’t even necessarily perform I/O in an OOM condition–the data structures that the syscalls build and submit to the block layer and I/O scheduler and so on obviously must be allocated first.
(I should concede, here, that on a system without swap your arguments are on much stronger ground.)
I don’t address the shared memory counting (3) because I agree with you. Hopefully the work discussed in the LWN article I mentioned, http://lwn.net/Articles/359998/, will go somewhere.
With respect to (4): Yeah, you can make a guess based on allocated VM space, but it would give you a false positive more often than you think. Say you have a system with only 3M or so free and you’re sitting at a shell. Suppose further that you happen to have a pre-compiled “Hello, World” application sitting around. Chances are you wouldn’t be able to run it: the first thing it’s going to do when you run it is mmap() /lib/libc.so, which is probably 2M or so; then it’ll try to call printf(). Even if we ignore stdio’s internal buffers, printf() typically calls malloc() internally while building it’s string, and the first thing malloc() is going to do is sbrk()/mmap() at least 1M of address space because it’s trying to amortize that cost over lots of malloc()s, and doesn’t realize the program is about to terminate. Boom, allocated VM space exceeds available memory–never mind the fact such an process probably has a peak unshared memory footprint of like 16K.
No system call is cheap on the time scale relevant to a good general purpose malloc implementation, unfortunately.
You might want to take a look at the Android modifications to the OOM killer.
http://lwn.net/Articles/317814/
http://lwn.net/Articles/104185/ contains my favourite explanation of OOM.
I recently turned on strict accounting on a client’s machine to try to minimise the OOM danger. Sadly, while Postgres might be well behaved, other apps are no, and the system very quickly went belly up. Saying “fix your leaky app” doesn’t help much when the app is PHP. I had to turn it off again and hope like hell we don’t have a nasty incident.
“[...]From Jeff Davis’s Experimental Thoughts comes this post on Postgres and the Linux OOM Killer. … Yes, that would make any Postgres DBA feel kind of ranty.”
Log Buffer #171
Like you, I am completely dumbfounded with the “solution” to this problem.
I am pro-Linux, but that any of the kernel developers defend this OOM mechanism is an embarrassment. It’s so obviously wrong to arbitrarily kill processes by any heuristic. Even if the heuristic were improved, there’s no reason (ever) to be killing off applications which were behaving correctly. Sounds like that was designed by a banker, doesn’t it?
Process badness?? Who are we kidding here, this is pure kernel badness. This is a glaring flaw on an otherwise reliable platform.
The solution should address the real problems:
1. The system should never over-commit. Not enough memory to fulfill a new memory request, then fail *deterministically* where applications can deal with it.
2. If this prevents a process from forking, then so be it. At least the system remains stable, and the failed request can be handled gracefully.
3. Maybe the fork/exec combination is fundamentally flawed. Even when it works it’s not particularly efficient, this could be an opportunity to improve the system calls.
4. Some feel that over-commit is desirable. I personally disagree that this is acceptable in any production setting, but if it’s necessary then it should be explicit. The kernel must always honor the memory contracts between itself and processes, no exceptions. If an application wants/needs to over commit ram it probably won’t use, then let it do so explicitly and at it’s own risk.
Conceptually a modified fork call could keep the parent safe from overcommit, and allow it’s child to over commit until it runs exec, the new executable could do whatever it pleases with regards to overcommit, but by default all processes have the right to expect the kernel will honor it’s obligations.
I simply cannot believe my ears when i hear Linux kernel hackers debating OOM process killer selection heuristics rather than how to actually fix the problem.
The OOM killer is also quite annoying if you rely on garbage collection. The problem for GC is, that most CG implementations will defer garbage collection, until allocating new memory/VM won’t work, in which case OOM is assumed, garbage collected and allocation retried. Garbage collection is a very good counter example of when freeing memory in a OOM situation is well defined and doesn’t run into the mentioned rollback trouble.
Since malloc and sbrk are allocating VM, I think there should be a mechanism, that kindly signals processes to free memory, before getting into a severe OOM situation. Not as a reaction of acute memory demands, but a kernel process, that constantly monitors the system and every now and then nudges the processes.
A good criterion for that would be the ratio of a process allocated pages / pages recently accessed. For garbage collected, but also leaking processes this number may grow rather quickly. In the case of a process, which actually freed up memory in that situation, the kernel could give bonus points, not to kill it in OOM. It could also track statistics about the average of the additionally mapped pages during memory deallocation, the fewer the better.
Random idea :
At some threshold before real OOM happens (say, 90% of swap+physical is used), probabilistically refuse brk() calls (with the probability of refusal increasing as memory use nears 100%).
This would let the dumb apps crash themselves, and let postgres recover cleanly.