This article provides an overview of memory tuneables which are required to adjust on a Greenplum Database (GPDB) cluster.
The purpose of this guide is to provide an overview of memory tuneables which are required or recommended to adjust on a Greenplum Database cluster.
Please engage with your system administrator team to discuss best practices specific to your environment.
As discussed in documentation, it is required to adjust two tuneables related to virtual memory overcommitting:
vm.overcommit_memory = 2 (default is 0) vm.overcommit_ratio = 95 (default is 50)
Additionally, it is required to disable transparent huge pages by adding, transparent_hugepage=never to the kernel command line.
The following additional memory tuning is recommended:
vm.swappiness = 10 (default is 30) vm.dirty_expire_centisecs = 500 (default is 3000) vm.dirty_writeback_centisecs = 100 (default is 500)
On all systems, set vm.min_free_kbytes to 3% of system memory. The following command will give you the value to set in /etc/sysctl.conf.
awk 'BEGIN {OFMT = "%.0f";} /MemTotal/ {print "vm.min_free_kbytes =", $2 * .03;}' /proc/meminfo
For systems with 64GB of memory or less, the following tuning is recommended:
vm.dirty_background_ratio = 3 (default is 10) vm.dirty_ratio = 10 (default is 30)
For systems with more than 64GB of memory, the following tuning is recommended:
vm.dirty_background_bytes = 1610612736 (1.5 GB - Default is 0, as dirty_background_ratio is applied instead) vm.dirty_bytes = 4294967296 (4GB - Default is 0, as dirty_ratio is applied instead) vm.dirty_background_ratio=0 (as the background bytes will take effect) vm.dirty_ratio=0
Please apply all tuneables by adding the relevant line to "/etc/sysctl.conf"
. Then apply the changes with the command "# sysctl -p"
or by restarting the system. It is not recommended to apply these tuneables while the database is running. Please stop the database, apply the tuneables, and then restart the database.
The vm.min_free_kbytes tuneable is set to 3% of system memory to ensure that get free page (GFP) calls are always successful. This has the one benefit of performance improvement - when a GFP with a reclaim flag is set, an unsuccessful malloc goes into memory reclaim, so the allocation takes longer. Alternatively, in a GFP_ATOMIC call, the function will simply fail if free pages are unavailable. This is the case, for example, the creation of a socket buffer once a packet has been received. If a socket buffer cannot be created, the packet may be discarded.
The kernel does increase the min_free_kbytes value, but only up to a maximum of 64MB, which is hugely too small for large memory systems which are available today. See the comment on init_per_zone_wmark_min in mm/page_alloc.c for the scaling factor used.
On Greenplum systems, a min_free_kbytes value which is too small usually first appears as a performance problem on smaller memory systems - around 100GB of system memory. If the value is left at the default maximum of 64MB, on large or very large memory systems - greater than 500GB - critical issues tend to manifest, in the form of network failures leading to highly variable and poor performance, segment failures, and the database becoming hung.
For an overview of the overcommit memory tuneables, please see this guide: Linux Overcommit strategies and Pivotal Greenplum(GPDB)/Pivotal HDB(HDB)
Use the Greenplum memory calculator to receive memory tuning settings for the overcommit_ratio value, as well as the gp_vmem_protect_limit
.
The swappiness tuneable is adjusted so that the system avoids swapping to disk. This resource is intensive and should be avoided. An algorithm determines the tendency of the operating system to swap at any given memory pressure. Setting the tuneable to 10 should be sufficient to avoid swapping in most extreme memory pressure scenarios.
The goal of the additional tuneables (dirty writeback and dirty ratio/bytes), is to encourage more active memory management. This pushes dirty pages to the disk more frequently and in smaller, more manageable increments.
Since GPDB does not use direct IO, the Linux Kernel will manage writing data to the disk. When the database determines that a page should be written to the disk, the memory address will be marked "dirty". The tuneables related to dirty writeback centisecs determines how often a kernel worker thread is started to write data to the disk. Additionally, the dirty ratio/bytes tuneables determine the amount of memory must be marked "dirty" before more extensive action is taken to flush dirty pages. Please see the documentation on kernel.org for more detailed discussion on these tuneables.
If dirty pages are not flushed to disk in a timely manner, two problems arise. First, because dirty pages are not reclaimable, memory pressure can be increased artificially, which might manifest in other areas as poor performance. If the database is at a place of rapidly reading and writing data concurrently, for example, in a CTAS statement, the combination of disk load and high memory pressure can lead to database and OS hangs, which might not be resolved for minutes or hours. For this reason, dirty pages should be set to a very low value on Greenplum systems, so that the database quickly hits the IO bottleneck without causing memory pressure, and performance is predictable and stable.
For additional information, please read the following kernel document pages on kernel.org:
Sysctl-vm (an overview of all tuneables in /proc/sys/vm)
For further details on kernel free memory accounting, see this discussion on GFP allocation
See the calculation and maximum value for min_free_kbytes here in the kernel source