When running vSAN 6.7 builds from Update 3 P04 (17167734) and before P05 (17700523) or from 7.0 U1c (17325551) before 7.0 U2 (17630552)
AND one of the following is occurring:
The number of elements in the commit tables" are more than 100k and do not decrease over a period of X hours (refer to script 2 below)
OR
One or more condition matches :
LSOM: LSOM_ThrowCongestionVOB:3429: Throttled: Virtual SAN node "HOSTNAME" maximum Memory congestion reached.
LSOM: LSOM_ThrowCongestionVOB:3429: Throttled: Virtual SAN node "HOSTNAME"
maximum Memory congestion reached
Script 1 :
while true; do echo "================================================"; date; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do llogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by LLOG"|awk -F \: '{print $2}');plogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by PLOG"|awk -F \: '{print $2}');llogGib=$(echo $llogTotal |awk '{print $1 / 1073741824}');plogGib=$(echo $plogTotal |awk '{print $1 / 1073741824}');allGibTotal=$(expr $llogTotal \+ $plogTotal|awk '{print $1 / 1073741824}');echo $ssd;echo " LLOG consumption: $llogGib";echo " PLOG consumption: $plogGib";echo " Total log consumption: $allGibTotal";done;sleep 30; done ;
Sample output from script-1:
Fri Feb 12 06:40:51 UTC 2021
529dd4dc--xxxx-xxxx-xxxx-xxxxxxxxxxxx
memCongestion:0 >> This value is higher than 0 ( ranger 0-250 )
slabCongestion:0
ssdCongestion:0
iopsCongestion:0
logCongestion:0
compCongestion:0
memCongestionLocalMax:0
slabCongestionLocalMax:0
ssdCongestionLocalMax:0
iopsCongestionLocalMax:0
logCongestionLocalMax:0
compCongestionLocalMax:0
529dd4dc-xxxx-xxxx-xxxx-xxxxxxxxxxxxxx
LLOG consumption: 0.270882
PLOG consumption: 0.632553
Total log consumption: 0.903435
vsish -e ls /vmkModules/lsom/disks/ 2>/dev/null | while read d ; do echo -n ${d/\//} ; vsish -e get /vmkModules/lsom/disks/${d}WBQStats | grep "Number of elements in commit tables" ; done | grep -v ":0$"
Sample output for two DiskGroups on a host (please verify that lines returned match all cache disks, and you may ignore any capacity disks that may be listed):
for i in $(seq 1 20000) ;
do date ;
echo -e "======== \e[0m" ;
for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do
llogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by LLOG"|awk -F \: '{print $2}');
plogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by PLOG"|awk -F \: '{print $2}');
llogGib=$(echo $llogTotal |awk '{print $1 / 1073741824}');plogGib=$(echo $plogTotal |awk '{print $1 / 1073741824}');
allGibTotal=$(expr $llogTotal \+ $plogTotal|awk '{print $1 / 1073741824}');echo $ssd;
echo -e "\e[1;33m LLOG consumption: $llogGib";
echo -e "\e[1;33m PLOG consumption: $plogGib";
echo -e "\e[1;33m Total log consumption: $allGibTotal" ;done ;
sleep 10 ;
echo -e "======== \e[0m" ;
date ;
for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);
do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion ;done ;
sleep 10 ;
echo -e "======== \e[0m" ;
for ssd in $(localcli vsan storage list|grep "Group UUID" |sort -u|awk '{print $5}');
do echo $ssd;
consumptionTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "space consumed by"|awk -F \: '{sum+=$2}END{printf("%.0f\n", sum);}');
devName=$(vsish -e get /vmkModules/plog/devices_by_uuid/$ssd/info|grep "Disk Name"|awk -F \: '{print $2}');
dgConsumption=$(vsish -e get /vmkModules/plog/devices/$devName/elevStats|grep "for diskgroup"|awk -F \: '{print $2}');
zeroTotal=`echo $(($dgConsumption - $consumptionTotal))`;isNeg=$(echo $zeroTotal|grep ^\-);if [[ "$isNeg" != "" ]];
then zeroTotal="NA";zeroTotalGib="NA";else zeroTotalGib=$(echo $zeroTotal|awk '{printf("%.2f\n", $1 / 1024 / 1024 / 1024)}');fi;
echo -e "\t Total elevator data: $dgConsumption";echo -e "\t Total LLOG/PLOG data: $consumptionTotal";
echo -e "\t Total zero data: $zeroTotal ($zeroTotalGib GiB)";done
sleep 20 ;
echo -e "######### \e[0m" ;
done ;
Scrubber configuration values were modified in vSAN 6.7 P04 and vSAN 7.0 U1 P02 releases to scrub objects at a higher frequency. This results in persisting scrubber progress of each object more frequently than before. If there are idle objects in the cluster, then commit table entries for these objects created by the scrubber will accumulate at LSOM. Eventually, the accumulation will lead to LSOM memory congestion.
Idle objects in this context refer to objects which are unassociated / powered off VMs / replicated objects..etc.
This specific cause of Mem Congestion is is seen on the following vSphere/vSAN releases :
ESXi 6.7 Update 3 P04 Build : 17167734
ESXi 6.7 Update 3 EP18 Build: 17499825
ESXi 7.0 Update 1d Build : 17551050
ESXi 7.0 Update 1c Build : 17325551
Read and follow workaround section carefully .
Additionally, high Mem Congestion has also been noted on the following builds for other underlying causes, which are resolved after upgrading to 6.7 P05, but no other resolution is available for:
ESXi 6.7 Update 3 EP15: Build 16316930
ESXi 6.7 Update 3 P03 Build: 16713306
ESXi 6.7 Update 3 EP16 Build: 16773714
ESXi 6.7 Update 3 EP17 Build: 17098360
VMware Engineering Team is aware of this issue and has released the fix in vSAN 6.7 P05 and vSAN 7.0 U2 GA.
NOTE : It is recommended to apply the following config changes, even if customer are not seeing LSOM memory congestion proactively.
# esxcfg-advcfg -s 1 /VSAN/ObjectScrubsPerYear
# esxcfg-advcfg -s 0 /VSAN/ObjectScrubPersistMin
To remediate all hosts which have already hit high memory congestion issue, it is recommended to query "Number of elements in commit tables" and perform following steps below
# vsish -e ls /vmkModules/lsom/disks/ 2>/dev/null | while read d ; do echo -n ${d/\//} ; vsish -e get /vmkModules/lsom/disks/${d}WBQStats | grep "Number of elements in commit tables" ; done | grep -v ":0$"
Sample output for two DiskGroups on a host:
52f395f3-03fd-f005-bf02-40287362403b/ Number of elements in commit tables:300891 526709f4-8790-8a91-2151-a491e2d3aec5/ Number of elements in commit tables:289371
# esxcfg-advcfg -s 1 /VSAN/ObjectScrubsPerYear
# esxcfg-advcfg -s 0 /VSAN/ObjectScrubPersistMin
Performance degradation due to high LSOM Memory congestion caused by high commit table entries.