Here are few symptoms to check the nature of the issue -
+ Queries are not progressing or even cancelling from End-User perspective.
+ Database connection are not successfully establishing.
+ gpstop not progressing.
+ gpssh to all or some segment hosts not responding.
+ Load it excessively high on all segment hosts.
All Greenplum versions.
Congestion and Slowness concluded as side effect of CPU usage towards excessive SWAP usage
+ Stuck spinlocks shown in logs:
2024-06-13 21:30:35.569294 UTC,"read_only","coredw",p28261,th-1391740800,"xx.xx.0.11","40302",2024-06-13 20:30:54 UTC,0,con368574,cmd44,seg224,slice239,,,sx1,"PANIC","XX000","stuck spinlock (0x7f55981a300c) detected at instrument.c:398 (s_lock.c:42)",,,,,,,0,,"s_lock.c",42," Stack trace:
1 0xc015e7 postgres errstart (elog.c:557)
2 0xc0447e postgres elog_finish (elog.c:1728)
3 0xa8787e postgres <symbol not found> (s_lock.c:41)
4 0x8dd972 postgres <symbol not found> (discriminator 1)
5 0xc420e2 postgres <symbol not found> (discriminator 3)
6 0xc41ff0 postgres <symbol not found> (discriminator 3)
7 0xc427be postgres ResourceOwnerRelease (discriminator 2)
8 0x732b8c postgres <symbol not found> (xact.c:3365)
9 0x735455 postgres AbortCurrentTransaction (xact.c:3982)
10 0xa993b0 postgres PostgresMain (postgres.c:5069)
11 0x6b3553 postgres <symbol not found> (postmaster.c:4492)
12 0xa1ecb6 postgres PostmasterMain (postmaster.c:1517)
13 0x6b7431 postgres main (main.c:205)
14 0x7f55a9a67555 libc.so.6 __libc_start_main + 0xf5
15 0x6c32ac postgres <symbol not found> + 0x6c32ac
+ CPU saturation with high system user & IO Wait usage
CPU %usr %nice %sys %iowait %steal %irq %soft %guest %gnice %idle
09:30:40 PM all 0.34 0.00 82.75 16.20 0.00 0.00 0.69 0.00 0.00 0.02
09:31:26 PM all 0.37 0.00 0.90 97.82 0.00 0.00 0.91 0.00 0.00 0.00
+ Excessive Swap usage during congestion.
kbswpfree kbswpused %swpused kbswpcad %swpcad
09:30:40 PM 250979304 34575112 12.11 276084 0.80
09:31:26 PM 253616356 31938060 11.18 392280 1.23
09:32:26 PM 254341568 31212848 10.93 579800 1.86
09:33:26 PM 255950616 29603800 10.37 638020 2.16
+ SWAP usage usage during normal processing.
12:01:01 AM 275034356 10520060 3.68 75528 0.72
+ Swap requests metrics shows SWAP demand
pswpin/s pswpout/s
08:32:02 PM 702.19 13874.26
:
09:30:40 PM 19.25 3329.45
1. Reduce Swap Usage
+ Change Kernel parameter vm.swappiness from 10 to 1. This will help cluster reduce overhead of swap usage and in-turn save CPU processing spent towards swap processing and use the saved cpu cycles towards user workloads.
+ Create a copy of exiting file /etc/sysctl.conf as a backup and modify the current file for vm.swappiness = 1 on Coordinator and all segment hosts in the cluster. Host restart is not required to apply the changes.
Refer for more detailed info Overview of memory tuning best practices for Greenplum Database
2. Optimize Memory usage - Refer for more details Resource Queues and Memory Management
3. Check if cgroups are enabled. See Premature swapping while there is still plenty of pagecache to be reclaimed KB on the RedHat web site.
If Greenplum DB is using Resource Queues, then disable cgroups completely on all host in the cluster. "systemctl disable cgconfig" and reboot the hosts.
if cgroups v1 is used (cgroups v2 would be preferable):
Set "vm.force_cgroup_v2_swappiness = 1" in the /etc/sysctl.conf
Check that the kernel parameter "vm.swappiness = 10" in the /etc/sysctl.conf file. (It is possible to set this to 1 if 10 seems too high. But never set it to 0 as it will cause issues with the oom-killer)
if cgroups v2 is used:
Check that the kernel parameter "vm.force_cgroup_v2_swappiness = 0"
Check that the kernel parameter "vm.swappiness = 1"
If Greenplum DB is using Resource Groups, ensure cgroups are configured as described in the documentation