Comments on: Linux OOM Killer http://thoughts.davisjeff.com/2009/11/29/linux-oom-killer/ Ideas on Databases, Logic, and Language by Jeff Davis Tue, 19 Jun 2012 16:18:51 +0000 hourly 1 http://wordpress.org/?v=3.3.1 By: Pierre http://thoughts.davisjeff.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-1423 Pierre Thu, 23 Dec 2010 22:52:57 +0000 http://thoughts.davisjeff.com/?p=200#comment-1423 Random idea : At some threshold before real OOM happens (say, 90% of swap+physical is used), probabilistically refuse brk() calls (with the probability of refusal increasing as memory use nears 100%). This would let the dumb apps crash themselves, and let postgres recover cleanly. Random idea :

At some threshold before real OOM happens (say, 90% of swap+physical is used), probabilistically refuse brk() calls (with the probability of refusal increasing as memory use nears 100%).

This would let the dumb apps crash themselves, and let postgres recover cleanly.

]]>
By: Wolfgang Draxinger http://thoughts.davisjeff.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-233 Wolfgang Draxinger Sun, 04 Jul 2010 01:17:29 +0000 http://thoughts.davisjeff.com/?p=200#comment-233 The OOM killer is also quite annoying if you rely on garbage collection. The problem for GC is, that most CG implementations will defer garbage collection, until allocating new memory/VM won't work, in which case OOM is assumed, garbage collected and allocation retried. Garbage collection is a very good counter example of when freeing memory in a OOM situation is well defined and doesn't run into the mentioned rollback trouble. Since malloc and sbrk are allocating VM, I think there should be a mechanism, that kindly signals processes to free memory, before getting into a severe OOM situation. Not as a reaction of acute memory demands, but a kernel process, that constantly monitors the system and every now and then nudges the processes. A good criterion for that would be the ratio of a process allocated pages / pages recently accessed. For garbage collected, but also leaking processes this number may grow rather quickly. In the case of a process, which actually freed up memory in that situation, the kernel could give bonus points, not to kill it in OOM. It could also track statistics about the average of the additionally mapped pages during memory deallocation, the fewer the better. The OOM killer is also quite annoying if you rely on garbage collection. The problem for GC is, that most CG implementations will defer garbage collection, until allocating new memory/VM won’t work, in which case OOM is assumed, garbage collected and allocation retried. Garbage collection is a very good counter example of when freeing memory in a OOM situation is well defined and doesn’t run into the mentioned rollback trouble.

Since malloc and sbrk are allocating VM, I think there should be a mechanism, that kindly signals processes to free memory, before getting into a severe OOM situation. Not as a reaction of acute memory demands, but a kernel process, that constantly monitors the system and every now and then nudges the processes.

A good criterion for that would be the ratio of a process allocated pages / pages recently accessed. For garbage collected, but also leaking processes this number may grow rather quickly. In the case of a process, which actually freed up memory in that situation, the kernel could give bonus points, not to kill it in OOM. It could also track statistics about the average of the additionally mapped pages during memory deallocation, the fewer the better.

]]>
By: Lou Gosselin http://thoughts.davisjeff.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-177 Lou Gosselin Fri, 26 Feb 2010 02:43:23 +0000 http://thoughts.davisjeff.com/?p=200#comment-177 Like you, I am completely dumbfounded with the "solution" to this problem. I am pro-Linux, but that any of the kernel developers defend this OOM mechanism is an embarrassment. It's so obviously wrong to arbitrarily kill processes by any heuristic. Even if the heuristic were improved, there's no reason (ever) to be killing off applications which were behaving correctly. Sounds like that was designed by a banker, doesn't it? Process badness?? Who are we kidding here, this is pure kernel badness. This is a glaring flaw on an otherwise reliable platform. The solution should address the real problems: 1. The system should never over-commit. Not enough memory to fulfill a new memory request, then fail *deterministically* where applications can deal with it. 2. If this prevents a process from forking, then so be it. At least the system remains stable, and the failed request can be handled gracefully. 3. Maybe the fork/exec combination is fundamentally flawed. Even when it works it's not particularly efficient, this could be an opportunity to improve the system calls. 4. Some feel that over-commit is desirable. I personally disagree that this is acceptable in any production setting, but if it's necessary then it should be explicit. The kernel must always honor the memory contracts between itself and processes, no exceptions. If an application wants/needs to over commit ram it probably won't use, then let it do so explicitly and at it's own risk. Conceptually a modified fork call could keep the parent safe from overcommit, and allow it's child to over commit until it runs exec, the new executable could do whatever it pleases with regards to overcommit, but by default all processes have the right to expect the kernel will honor it's obligations. I simply cannot believe my ears when i hear Linux kernel hackers debating OOM process killer selection heuristics rather than how to actually fix the problem. Like you, I am completely dumbfounded with the “solution” to this problem.

I am pro-Linux, but that any of the kernel developers defend this OOM mechanism is an embarrassment. It’s so obviously wrong to arbitrarily kill processes by any heuristic. Even if the heuristic were improved, there’s no reason (ever) to be killing off applications which were behaving correctly. Sounds like that was designed by a banker, doesn’t it?

Process badness?? Who are we kidding here, this is pure kernel badness. This is a glaring flaw on an otherwise reliable platform.

The solution should address the real problems:
1. The system should never over-commit. Not enough memory to fulfill a new memory request, then fail *deterministically* where applications can deal with it.

2. If this prevents a process from forking, then so be it. At least the system remains stable, and the failed request can be handled gracefully.

3. Maybe the fork/exec combination is fundamentally flawed. Even when it works it’s not particularly efficient, this could be an opportunity to improve the system calls.

4. Some feel that over-commit is desirable. I personally disagree that this is acceptable in any production setting, but if it’s necessary then it should be explicit. The kernel must always honor the memory contracts between itself and processes, no exceptions. If an application wants/needs to over commit ram it probably won’t use, then let it do so explicitly and at it’s own risk.

Conceptually a modified fork call could keep the parent safe from overcommit, and allow it’s child to over commit until it runs exec, the new executable could do whatever it pleases with regards to overcommit, but by default all processes have the right to expect the kernel will honor it’s obligations.

I simply cannot believe my ears when i hear Linux kernel hackers debating OOM process killer selection heuristics rather than how to actually fix the problem.

]]>
By: Evan P http://thoughts.davisjeff.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-164 Evan P Tue, 08 Dec 2009 09:15:20 +0000 http://thoughts.davisjeff.com/?p=200#comment-164 Sorry, some sloppy thinking on my part. What I said is only true if the application is free()ing and never malloc()ing, and the kernel is off by whatever amortization margin the malloc() implementation uses (3MB for the typical FreeBSD 8.0 configuration, as I mentioned earlier). Your syscall idea is like an mlock() that reserves either or both of RAM and swap instead of only RAM, or like a MAP_RESERVE flag for Linux's mmap() to compliment the MAP_NORESERVE flag. I think there's some merit to that suggestion. The only other thing I'd note is that heap implementations typically give each thread its own arena so as to minimize lock contention on SMP systems. Each of them would need an independent amortized syscall; what I'm getting at is that if this were a workable solution in general, /proc/sys/vm/overcommit_memory==2 would be the default. Of course, it's probably a perfectly workable default for people running Postgres ;-) Sorry, some sloppy thinking on my part. What I said is only true if the application is free()ing and never malloc()ing, and the kernel is off by whatever amortization margin the malloc() implementation uses (3MB for the typical FreeBSD 8.0 configuration, as I mentioned earlier).

Your syscall idea is like an mlock() that reserves either or both of RAM and swap instead of only RAM, or like a MAP_RESERVE flag for Linux’s mmap() to compliment the MAP_NORESERVE flag. I think there’s some merit to that suggestion.

The only other thing I’d note is that heap implementations typically give each thread its own arena so as to minimize lock contention on SMP systems. Each of them would need an independent amortized syscall; what I’m getting at is that if this were a workable solution in general, /proc/sys/vm/overcommit_memory==2 would be the default.

Of course, it’s probably a perfectly workable default for people running Postgres ;-)

]]>
By: Pádraig Brady http://thoughts.davisjeff.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-162 Pádraig Brady Mon, 07 Dec 2009 14:11:10 +0000 http://thoughts.davisjeff.com/?p=200#comment-162 If FreeBSD handles it fine then that's cool (albeit beside the point). PostgreSQL will exit or is in a detectable bad state and can be restarted. If FreeBSD handles it fine then that’s cool (albeit beside the point). PostgreSQL will exit or is in a detectable bad state and can be restarted.

]]>
By: Log Buffer http://thoughts.davisjeff.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-161 Log Buffer Fri, 04 Dec 2009 20:25:06 +0000 http://thoughts.davisjeff.com/?p=200#comment-161 "[...]From Jeff Davis’s Experimental Thoughts comes this post on Postgres and the Linux OOM Killer. ... Yes, that would make any Postgres DBA feel kind of ranty." <a href="http://www.pythian.com/news/6165/log-buffer-171-a-carnival-of-the-vanities-for-dbas/" rel="nofollow">Log Buffer #171</a> “[...]From Jeff Davis’s Experimental Thoughts comes this post on Postgres and the Linux OOM Killer. … Yes, that would make any Postgres DBA feel kind of ranty.”

Log Buffer #171

]]>
By: Jeff Davis http://thoughts.davisjeff.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-160 Jeff Davis Wed, 02 Dec 2009 19:09:49 +0000 http://thoughts.davisjeff.com/?p=200#comment-160 "kernel already knows exactly how much memory the application is using" Knows how much the application is using, or knows how much malloc has told the application that it can use? I thought the problem was VM space versus actually allocated memory. If, in addition to allocating VM space, malloc explicitly asked the OS to reserve memory (say, every 1024 pages or so to amortize the cost of a syscall) before returning a valid pointer, then the OS would have an opportunity to say "there's no way I can reserve those pages for you", and malloc could return NULL. I understand there are still COW issues with fork and dlopen, but I'm not looking for a perfect solution. In those cases the OS has a pretty good idea how much it has extended itself (number of processes sharing that page that might try to write). The main problem is with malloc, where (as you say) there is such a huge disconnect between the VM size and the memory that the application has requested that the OS doesn't know what kind of trouble its in early enough, and even if it does, malloc has no way of knowing the OS may not be able to satisfy the requests. “kernel already knows exactly how much memory the application is using”

Knows how much the application is using, or knows how much malloc has told the application that it can use?

I thought the problem was VM space versus actually allocated memory.

If, in addition to allocating VM space, malloc explicitly asked the OS to reserve memory (say, every 1024 pages or so to amortize the cost of a syscall) before returning a valid pointer, then the OS would have an opportunity to say “there’s no way I can reserve those pages for you”, and malloc could return NULL.

I understand there are still COW issues with fork and dlopen, but I’m not looking for a perfect solution. In those cases the OS has a pretty good idea how much it has extended itself (number of processes sharing that page that might try to write).

The main problem is with malloc, where (as you say) there is such a huge disconnect between the VM size and the memory that the application has requested that the OS doesn’t know what kind of trouble its in early enough, and even if it does, malloc has no way of knowing the OS may not be able to satisfy the requests.

]]>
By: Dan Farina http://thoughts.davisjeff.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-159 Dan Farina Wed, 02 Dec 2009 18:08:12 +0000 http://thoughts.davisjeff.com/?p=200#comment-159 Then the distributors can complain :) But end users that install Postgres will be happier and less surprised and the region of pain will shrink a little bit to packages. Sub-ideal, but progress. Then the distributors can complain :) But end users that install Postgres will be happier and less surprised and the region of pain will shrink a little bit to packages. Sub-ideal, but progress.

]]>
By: Evan P http://thoughts.davisjeff.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-158 Evan P Wed, 02 Dec 2009 08:19:33 +0000 http://thoughts.davisjeff.com/?p=200#comment-158 Actually, malloc has no idea how much memory within the heap is in use and won't page fault: you're ignoring fork(). In any case the kernel already knows exactly how much memory the application is using within the margin of error maintained by the malloc() implementation, so you don't need to tell it with a syscall. It just doesn't have much of an opportunity to act on that information until there's already a pending page fault it can't service without killing something first. Also, don't forget things like dlopen(), which makes a large shared mapping and then sparsely copy-on-write faults individual pages as it performs relocations: the address space based guess you proposed in (4) would trip up on that, too. Actually, malloc has no idea how much memory within the heap is in use and won’t page fault: you’re ignoring fork(). In any case the kernel already knows exactly how much memory the application is using within the margin of error maintained by the malloc() implementation, so you don’t need to tell it with a syscall. It just doesn’t have much of an opportunity to act on that information until there’s already a pending page fault it can’t service without killing something first.

Also, don’t forget things like dlopen(), which makes a large shared mapping and then sparsely copy-on-write faults individual pages as it performs relocations: the address space based guess you proposed in (4) would trip up on that, too.

]]>
By: Evan P http://thoughts.davisjeff.com/2009/11/29/linux-oom-killer/comment-page-1/#comment-157 Evan P Wed, 02 Dec 2009 08:01:09 +0000 http://thoughts.davisjeff.com/?p=200#comment-157 In response to (1): I'm sure Postgres does degrade quite well on a system with overcommit disabled, but you're greatly underestimating how difficult it is to degrade gracefully in an OOM condition when memory is overcommitted. Consider that, by the time an OOM condition is reached, the paging daemon has been awake long enough to give every single memory access attempted by Postgres a non-zero risk of page faulting--code, data, heap, stack, anything. Postgres might squeak by because the memory it needs to touch in order to ROLLBACK is probably last in the pager's LRU queue, but it does so on dumb luck, with a probability that has nothing to do with its own code's carefulness and everything to do with what else is happening in the system. The "bizarre situation" is actually the common case, because once the paging deamon awakens having touched a page in the past gives you little protection. By the way, calling free() doesn't actually free memory just as calling malloc() doesn't actually allocate it. To actually free physical pages, a malloc() implementation must unmap the address space to which they are mapped using sbrk() or munmap(), or must madvise(MADV_FREE) it. Both of these operations are quite expensive, so malloc() implementations amortize the cost. The default configuration of FreeBSD's jemalloc, for example, unmaps in 1M (contiguous) chunks and lets 2M (discontiguous) accummulate before calling madvise(). In response to (2) I'll stress again that degrading gracefully after hitting an RLIMIT_* is an entirely different beast than degrading gracefully after the paging daemon has injected non-determinism into your application. And of course it's worse than that; you can't even necessarily perform I/O in an OOM condition--the data structures that the syscalls build and submit to the block layer and I/O scheduler and so on obviously must be allocated first. (I should concede, here, that on a system without swap your arguments are on much stronger ground.) I don't address the shared memory counting (3) because I agree with you. Hopefully the work discussed in the LWN article I mentioned, http://lwn.net/Articles/359998/, will go somewhere. With respect to (4): Yeah, you can make a guess based on allocated VM space, but it would give you a false positive more often than you think. Say you have a system with only 3M or so free and you're sitting at a shell. Suppose further that you happen to have a pre-compiled "Hello, World" application sitting around. Chances are you wouldn't be able to run it: the first thing it's going to do when you run it is mmap() /lib/libc.so, which is probably 2M or so; then it'll try to call printf(). Even if we ignore stdio's internal buffers, printf() typically calls malloc() internally while building it's string, and the first thing malloc() is going to do is sbrk()/mmap() at least 1M of address space because it's trying to amortize that cost over lots of malloc()s, and doesn't realize the program is about to terminate. Boom, allocated VM space exceeds available memory--never mind the fact such an process probably has a peak unshared memory footprint of like 16K. No system call is cheap on the time scale relevant to a good general purpose malloc implementation, unfortunately. In response to (1): I’m sure Postgres does degrade quite well on a system with overcommit disabled, but you’re greatly underestimating how difficult it is to degrade gracefully in an OOM condition when memory is overcommitted. Consider that, by the time an OOM condition is reached, the paging daemon has been awake long enough to give every single memory access attempted by Postgres a non-zero risk of page faulting–code, data, heap, stack, anything. Postgres might squeak by because the memory it needs to touch in order to ROLLBACK is probably last in the pager’s LRU queue, but it does so on dumb luck, with a probability that has nothing to do with its own code’s carefulness and everything to do with what else is happening in the system. The “bizarre situation” is actually the common case, because once the paging deamon awakens having touched a page in the past gives you little protection.

By the way, calling free() doesn’t actually free memory just as calling malloc() doesn’t actually allocate it. To actually free physical pages, a malloc() implementation must unmap the address space to which they are mapped using sbrk() or munmap(), or must madvise(MADV_FREE) it. Both of these operations are quite expensive, so malloc() implementations amortize the cost. The default configuration of FreeBSD’s jemalloc, for example, unmaps in 1M (contiguous) chunks and lets 2M (discontiguous) accummulate before calling madvise().

In response to (2) I’ll stress again that degrading gracefully after hitting an RLIMIT_* is an entirely different beast than degrading gracefully after the paging daemon has injected non-determinism into your application. And of course it’s worse than that; you can’t even necessarily perform I/O in an OOM condition–the data structures that the syscalls build and submit to the block layer and I/O scheduler and so on obviously must be allocated first.

(I should concede, here, that on a system without swap your arguments are on much stronger ground.)

I don’t address the shared memory counting (3) because I agree with you. Hopefully the work discussed in the LWN article I mentioned, http://lwn.net/Articles/359998/, will go somewhere.

With respect to (4): Yeah, you can make a guess based on allocated VM space, but it would give you a false positive more often than you think. Say you have a system with only 3M or so free and you’re sitting at a shell. Suppose further that you happen to have a pre-compiled “Hello, World” application sitting around. Chances are you wouldn’t be able to run it: the first thing it’s going to do when you run it is mmap() /lib/libc.so, which is probably 2M or so; then it’ll try to call printf(). Even if we ignore stdio’s internal buffers, printf() typically calls malloc() internally while building it’s string, and the first thing malloc() is going to do is sbrk()/mmap() at least 1M of address space because it’s trying to amortize that cost over lots of malloc()s, and doesn’t realize the program is about to terminate. Boom, allocated VM space exceeds available memory–never mind the fact such an process probably has a peak unshared memory footprint of like 16K.

No system call is cheap on the time scale relevant to a good general purpose malloc implementation, unfortunately.

]]>