In this article, we will take a look Linux Kernel Overcommit Settings and its relation with Tanzu Greenplum(GPDB) memory configuration.
Definition of Terms
Virtual Memory: The sum of all the RAM and SWAP in a given system. When speaking about memory in the context of this article, we are referring to Virtual Memory
Overcommit: Allocating more memory than available Virtual Memory
Allocate: in the context of memory management, an allocation of memory can be considered a "promise" that the memory is available. The actually physical memory is not assigned until it is actually needed. This assignment is done at a page level. When a new page (normally 8 KB) is needed, the system triggers a page fault.
Why Allow Overcommits?
The linux virtual memory implementation users several tactics optimize the amount of memory used (one such strategy is called "Copy on Write" and is used when forking child processes). The result of this is that often less memory is actually used than is reported via the /proc file system (and by extension ps).
In this case minor overcommits are acceptable as normally sufficient memory is available to service this. However this approach can result in memory being allocated when in truth not enough is free.
To handle this case Linux supports several different overcommit strategies specified by an integer value for the vm.overcommit setting.
Overcommit strategy 0
This is the default strategy that Linux uses. In this case all of the virtual memory is available to the system for allocations and all allocations are granted unless they appear to require a significant overcommit.
If, when a page fault occurs, there is not enough memory available (i.e. we have an overcommit), the system will trigger an "Out of Memory Killer" (OOM Killer). The OOM Killer will select a process currently running on the system and will terminate that process. It uses a set of heuristics to select the process to terminate. Note that it is usually not possible to predict when this process will be required nor which processes will be selected for termination.
Overcommit Strategy 1
This strategy is normally reserved for systems running processes that will be allocating very large arrays that are sparsely populated. In this mode, *any* allocation will be successful. In the event that an overcommit is detected, the process that detects the overcommit will generate a memory error and fail catastrophically (no clean up, process simply stops)
Please note that as memory is not assigned until needed a process that fails is not necessarily the one that has allocated the most memory. Due to the nature of memory usage predicting which process will fail due to memory overcommits is not possible.
Overcommit Strategy 2
This mode is *required* for GPDB and HDB. With this mode Linux performs strict memory accounting and will only grant an allocation required memory is actually available. As this check is done at time of allocation the programme requesting the memory can deal with the failure gracefully (in the case of GPDB generating an "Out of Memory" error) and cleaning up the session that's encountered the error.
This strategy will also allocate a portion of the physical RAM for strictly for kernel use. The amount restricted is configured by the setting vm.overcommit_ratio. This means the amount of virtual memory available for programs is actually:
SWAP + (RAM * (overcommit_ratio/100))
The reserved memory is used for things such as IO buffers and system calls. Note that on a moderate system we have observed the network buffers alone required more than 25 GB of memory at one time.
Why Use Strategy 2?
The issue with the default overcommits strategy, or using strategy 1 is the unpredictable nature of the failure. In either case we are not guaranteed that the memory is available at time of allocation and that results in unpredictable termination of processes. In particular, the nature of the failure with strategy 1 can result in corruption of either datafiles or transaction logs as a memory failure can occur mid-transaction resulting in the immediate shutdown of the database process without any cleanup.