vSAN -- vSAN nodes experience LSOM Memory congestion due to high "Number of elements in commit tables"
search cancel

vSAN -- vSAN nodes experience LSOM Memory congestion due to high "Number of elements in commit tables"

book

Article ID: 318125

calendar_today

Updated On:

Products

VMware vSAN VMware vSAN 7.x VMware vSAN 6.x

Issue/Introduction

Affected: Specific vSAN 6.7 and vSAN 7.x Builds (see Build Section for details).

Symptoms:

The "Number of elements in the commit tables" is more than 100K and does not decrease over a period of X hours (refer to section Script 2 below)

AND/OR

One or more of the following applies:

  • You may see vSAN Skyline Health Alarm "vSAN Health Service - Physical Disk Health – Congestion" showing Memory Congestion for one or more vSAN Host(s)
  • Memory Congestion Alarm / Congestion issue
  • All or some VMs may show as inaccessible in vCenter
  • All or multiple vSAN Hosts may become unresponsive
  • You may not be able to see the Files and/or Folders on vSAN Datastore when one or more vSAN Cluster nodes experience LSOM Memory Congestion
  • Severe Performance Degradation
  • Application Performance is down
  • On any of vSAN Hosts, Logs show one or more of the following messages:
 

DOM: DOM2PCPrintDescriptor:1797: [105568173:0x4313fe8f3718] => Stuck descriptor

LSOM: LSOM_ThrowCongestionVOB:3429: Throttled: Virtual SAN node "HOSTNAME" maximum Memory congestion reached.

LSOM_ThrowAsyncCongestionVOB:1669: LSOM Memory Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 204.

 

 

Script 1: Verify existing LSOM Memory Congestion on all vSAN Hosts:

while true; do echo "================================================"; date; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do llogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by LLOG"|awk -F \: '{print $2}');plogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by PLOG"|awk -F \: '{print $2}');llogGib=$(echo $llogTotal |awk '{print $1 / 1073741824}');plogGib=$(echo $plogTotal |awk '{print $1 / 1073741824}');allGibTotal=$(expr $llogTotal \+ $plogTotal|awk '{print $1 / 1073741824}');echo $ssd;echo " LLOG consumption: $llogGib";echo " PLOG consumption: $plogGib";echo " Total log consumption: $allGibTotal";done;sleep 30; done ;

Sample Output

529dd4dc-####-####-####-###############
   memCongestion:### >> This value will be higher than 0
   slabCongestion:0
   ssdCongestion:0
   iopsCongestion:0
   logCongestion:0
   compCongestion:0
   memCongestionLocalMax:0
   slabCongestionLocalMax:0
   ssdCongestionLocalMax:0
   iopsCongestionLocalMax:0
   logCongestionLocalMax:0
   compCongestionLocalMax:0 
529dd4dc-####-####-####-###############
  LLOG consumption: 0.270882 PLOG consumption: 0.632553 Total log consumption: 0.903435

 

 

Script 2 -- Verify current values of "Number of elements in commit tables":

vsish -e ls /vmkModules/lsom/disks/ 2>/dev/null | while read d ; do echo -n ${d/\//} ; vsish -e get /vmkModules/lsom/disks/${d}WBQStats | grep "Number of elements in commit tables" ; done | grep -v ":0$"

Sample output for two Disk Groups on a Host (please verify that lines returned match all Cache disks (ignore any Capacity disks):

529395f3-####-####-####-###############/   Number of elements in commit tables:300891    >> Disk Group affected ( = Value > 100K )
526709f4-####-####-####-###############/   Number of elements in commit tables:289371    >> Disk Group affected ( = Value > 100K )

 

Environment

This specific cause of LSOM Memory Congestion is observed on the following vSphere/vSAN Releases :
 
vSAN 6.7.x:
 
--->  From vSAN 6.7 U3 P04 ( 17167734 ) and before 6.7 U3 P05 ( 17700523 ) containing:
ESXi 6.7 Update 3 P04: 17167734
ESXi 6.7 Update 3 EP18: 17499825
 
 
vSAN 7.0.x
--->  From 7.0 U1c ( 17325551 ) before 7.0 U2 GA ( 17630552 )
ESXi 7.0 Update 1c ( 17325551 )
ESXi 7.0 Update 1d ( 17551050 )
 



Cause

High LSOM Memory Congestion caused by high Commit Table entries.
 
Scrubber configuration values were modified in vSAN 6.7 U3 P04 ( 17167734 ) and vSAN 7.0 U1c ( 17325551 ) releases to scrub vSAN Objects at a higher frequency.
This results in persisting scrubber progress of each vSAN Object more frequently than before.
 
If there are idle vSAN Objects in the Cluster, then commit table entries for these vSAN Objects created by the scrubber will accumulate at LSOM.
Eventually, the accumulation will lead to LSOM Memory Congestion.( Idle vSAN Objects in this context refer to vSAN Objects which are unassociated / powered off VMs / replicated vSAN Objects..etc. )

Resolution

Fixes are available in:

  • vSAN 6.7 P05 ( 17700523 )
  • vSAN 7.0 U2 GA ( 17630552 )
 


If any of the vSAN Hosts in the Cluster is showing "Number of elements in the commit tables" > 100K:

NOTE:
Even if no LSOM Memory Congestion has been observed but vSAN Hosts are on the affected Builds mentioned above
and showing for any of their Disk Groups "Number of elements in the commit tables" > 100K (as outlined in section "Issue/Introduction")
 
 
Perform the following steps in the descending order ( = Hosts/Disk Groups having highest values to the lowest values)
 
1.) Preparation: 
Put one Host in Maintenance Mode with "Ensure Accessibility" (only if the Host has to be rebooted)
and/or unmount and remount its Disk Groups by logging into it via SSH/Putty and executing:
 
esxcli vsan storage diskgroup unmount -s ############
esxcli vsan storage diskgroup mount -s ############
 
( Where ############ stands for the Cache Tier managing the Disk Group (= e.g. naa.xxxx, eui.xxxx, t10.NVMexxxxx )
 
 
2.) Reboot Host if needed (depending on its responsiveness)
3.) Put Host out of Maintenance Mode
4.) Repeat Steps 1-3 for the next vSAN Host
 
 
5.) Execute the following commands on all vSAN Hosts:
 
5.1) Change scrubber frequency to once per year (remove "#" for running):
# esxcfg-advcfg -s 1 /VSAN/ObjectScrubsPerYear
 
5.2) Disable scrubber persist timer (remove "#" for running):

# esxcfg-advcfg -s 0 /VSAN/ObjectScrubPersistMin

 

If assistance is required, please open a Ticket with VMware by Broadcom Support.

Additional Information

  • NOTE: Please be cautious while performing any Maintenance Mode tasks on vSAN Hosts running vSAN 7.0 U1 P02 (= 7.0 U1c 17325551 ) - Reference
  • High LSOM Memory Congestion has also been noted on older 6.7 Builds for other underlying causes, which are resolved after upgrading to 6.7 Update 3 P05 ( 17702396 )

Attachments

configure-dom-scrubber-frequency get_app