关于Linux系统的oom killer

1. 什么是oom killer

Linux 内核有个机制叫OOM killer（Out-Of-Memory killer），该机制会监控那些占用内存过大，尤其是瞬间很快消耗大量内存的进程，为了防止内存耗尽而内核会把该进程杀掉。典型的情况是：某天一台机器突然ssh远程登录不了，但能ping通，说明不是网络的故障，原因是sshd进程被OOM killer杀掉了。重启机器后查看系统日志/var/log/messages会发现Out of Memory: Kill process 1865（sshd）类似的错误信息。

2. 为什么需要oom killer

Linux内存管理模块有一个overcommit机制，意思是说，进程申请的内存可以大于当前系统free的内存，这可以通过/proc/sys/vm/overcommit_memory这个proc接口来配置。

它的默认值是0，即启发式策略，尽量减少swap的使用，root可以分配比一般用户略多的内存；
为1表示总是会允许overcommit，这一般适用于科学计算程序；
为2则表示不允许overcommit，系统申请的内存不能超过CommitLimit，在这种情况下，是进程申请内存返回错误，而不是去杀死进程。

Linux之所以这么设计，是出于这么一个考虑：进程申请的内存不会马上就被用到，并且，在进程的整个生命周期内，它也不会用到它申请的所有内存。如果没有overcommit，系统就不能够充分的利用它的内存，这样就会导致内存的浪费。overcommit就可以让系统更加高效的使用它的内存，但是与此同时也带来了一个风险：oom。memory-hogging程序能够耗尽整个系统的内存，从而导致整个系统处于halt的状态，在这种情况下，用户程序甚至连一个page的内存都无法申请，于是oom killer就出现了，它会识别出来可以为整个系统作出牺牲的进程，然后杀掉它，释放出来一些内存。

3. 控制oom killer

oom killer可以杀死哪些进程，而不应该杀死哪些进程，这确实是个难题，所以kernel就导出了一些接口给用户，让用户来控制，于是就把这个难题抛给了用户。
这个接口就是/proc/pid/oom_adj, 它的范围是-17~+15,值越高，就越容易被杀掉，如果把该值设置为-17,oom就永远也不会考虑杀它。

4. oom杀进程的策略

oom killer选择杀哪个进程，是基于它的badness score，该值体现在/proc/pid/oom_score里面。它的原则是，尽可能少杀进程来尽可能释放出足够多的内存，同时不去杀那些耗费内存很多的无辜进程。badness score的计算会用到进程的内存大小，CPU时间（user time + system time）, 运行时间，以及oom_adj值。进程消耗的内存越多，得分就越高；进程运行的时间越长，得分就越低。
oom killer选择victim进程的策略大致如下：

它必须拥有大量的页框
杀掉这个进程只会损失少量的工作
它的静态优先级必须低（可以通过nice来给不重要的进程设置低的优先级）
它不能够拥有root权限
它不能直接访问硬件
它不能够是0号进程（swapper），1号进程（init），以及内核线程

5. oom killer相关配置项

/proc/sys/vm/oom_dump_tasks (since Linux 2.6.25)

Enables a system-wide task dump (excluding kernel threads) to be produced when the kernel performs an OOM-killing. The dump includes the following information for each task (thread,
process): thread ID, real user ID, thread group ID (process ID), virtual memory size, resident set size, the CPU that the task is scheduled on, oom_adj score (see the description of
/proc/[pid]/oom_adj), and command name. This is helpful to determine why the OOM-killer was invoked and to identify the rogue task that caused it.
If this contains the value zero, this information is suppressed. On very large systems with thousands of tasks, it may not be feasible to dump the memory state information for each one.
Such systems should not be forced to incur a performance penalty in OOM situations when the information may not be desired.
If this is set to nonzero, this information is shown whenever the OOM-killer actually kills a memory-hogging task.
The default value is 0.

/proc/sys/vm/oom_kill_allocating_task (since Linux 2.6.24)

This enables or disables killing the OOM-triggering task in out-of-memory situations.
If this is set to zero, the OOM-killer will scan through the entire tasklist and select a task based on heuristics to kill. This normally selects a rogue memory-hogging task that frees
up a large amount of memory when killed.
If this is set to nonzero, the OOM-killer simply kills the task that triggered the out-of-memory condition. This avoids a possibly expensive tasklist scan.
If /proc/sys/vm/panic_on_oom is nonzero, it takes precedence over whatever value is used in /proc/sys/vm/oom_kill_allocating_task.
The default value is 0.

/proc/sys/vm/overcommit_kbytes (since Linux 3.14)

This writable file provides an alternative to /proc/sys/vm/overcommit_ratio for controlling the CommitLimit when /proc/sys/vm/overcommit_memory has the value 2. It allows the amount of
memory overcommitting to be specified as an absolute value (in kB), rather than as a percentage, as is done with overcommit_ratio. This allows for finer-grained control of CommitLimit
on systems with extremely large memory sizes.
Only one of overcommit_kbytes or overcommit_ratio can have an effect: if overcommit_kbytes has a nonzero value, then it is used to calculate CommitLimit, otherwise overcommit_ratio is
used. Writing a value to either of these files causes the value in the other file to be set to zero.

/proc/sys/vm/overcommit_memory

This file contains the kernel virtual memory accounting mode. Values are:
0: heuristic overcommit (this is the default)
1: always overcommit, never check
2: always check, never overcommit
In mode 0, calls of mmap(2) with MAP_NORESERVE are not checked, and the default check is very weak, leading to the risk of getting a process "OOM-killed". Under Linux 2.4, any nonzero
value implies mode 1.
In mode 2 (available since Linux 2.6), the total virtual address space that can be allocated (CommitLimit in /proc/meminfo) is calculated as
CommitLimit = (total_RAM - total_huge_TLB) *
overcommit_ratio / 100 + total_swap
where:
total_RAM is the total amount of RAM on the system;
total_huge_TLB is the amount of memory set aside for huge pages;
overcommit_ratio is the value in /proc/sys/vm/overcommit_ratio; and
total_swap is the amount of swap space.
For example, on a system with 16GB of physical RAM, 16GB of swap, no space dedicated to huge pages, and an overcommit_ratio of 50, this formula yields a CommitLimit of 24GB.
Since Linux 3.14, if the value in /proc/sys/vm/overcommit_kbytes is nonzero, then CommitLimit is instead calculated as:
CommitLimit = overcommit_kbytes + total_swap

/proc/sys/vm/overcommit_ratio (since Linux 2.6.0)

This writable file defines a percentage by which memory can be overcommitted. The default value in the file is 50. See the description of /proc/sys/vm/overcommit_memory.

/proc/sys/vm/panic_on_oom (since Linux 2.6.18)

This enables or disables a kernel panic in an out-of-memory situation.
If this file is set to the value 0, the kernel's OOM-killer will kill some rogue process. Usually, the OOM-killer is able to kill a rogue process and the system will survive.
If this file is set to the value 1, then the kernel normally panics when out-of-memory happens. However, if a process limits allocations to certain nodes using memory policies (mbind(2) MPOL_BIND) or cpusets (cpuset(7)) and those nodes reach memory exhaustion status, one process may be killed by the OOM-killer. No panic occurs in this case: because other nodes' memory may be free, this means the system as a whole may not have reached an out-of-memory situation yet.
If this file is set to the value 2, the kernel always panics when an out-of-memory condition occurs.
The default value is 0. 1 and 2 are for failover of clustering. Select either according to your policy of failover.

参考自：

http://mogu.io/159-159
man文档之proc