vSAN nodes experience LSOM Memory congestion due to high "Number of elements in commit tables" (vSAN 6.7 U3/vSAN 7.0 U1)
search cancel

vSAN nodes experience LSOM Memory congestion due to high "Number of elements in commit tables" (vSAN 6.7 U3/vSAN 7.0 U1)

book

Article ID: 318125

calendar_today

Updated On:

Products

VMware vSAN VMware vSAN 7.x VMware vSAN 6.x

Issue/Introduction

Affected: Specific vSAN 6.7 and vSAN 7.x Builds (see Environment Section for details).

Symptoms:

The "Number of elements in the commit tables" is more than 100K and does not decrease over a period of X hours (refer to section Script 2 below)

AND/OR

One or more of the following applies:

  • You may see vSAN Health Service - Physical Disk Health – Congestion showing Memory Congestion for one or more vSAN Host(s)
  • Memory Congestion Alarm / Congestion issue
  • All or some VMs may show as inaccessible in vCenter
  • All or multiple vSAN Hosts may become unresponsive
  • You may not be able to see the Files and/or Folders on vSAN Datastore when one or more vSAN Cluster nodes experience LSOM Memory Congestion
  • Severe Performance Degradation
  • Application Performance is down
  • On any of vSAN Hosts, Logs show one or more of the following messages:

DOM: DOM2PCPrintDescriptor:1797: [105###173:0x4313f###3718] => Stuck descriptor

LSOM: LSOM_ThrowCongestionVOB:3429: Throttled: Virtual SAN node "HOSTNAME" maximum Memory congestion reached.

LSOM_ThrowAsyncCongestionVOB:1669: LSOM Memory Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 204.

Script 1: Verify existing LSOM Memory Congestion on all vSAN Hosts:

while true; do echo "================================================"; date; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do llogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by LLOG"|awk -F \: '{print $2}');plogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by PLOG"|awk -F \: '{print $2}');llogGib=$(echo $llogTotal |awk '{print $1 / 1073741824}');plogGib=$(echo $plogTotal |awk '{print $1 / 1073741824}');allGibTotal=$(expr $llogTotal \+ $plogTotal|awk '{print $1 / 1073741824}');echo $ssd;echo " LLOG consumption: $llogGib";echo " PLOG consumption: $plogGib";echo " Total log consumption: $allGibTotal";done;sleep 30; done

Sample Output

529dd4dc-####-####-####-###############
   memCongestion:### >> This value will be higher than 0
   slabCongestion:0
   ssdCongestion:0
   iopsCongestion:0
   logCongestion:0
   compCongestion:0
   memCongestionLocalMax:0
   slabCongestionLocalMax:0
   ssdCongestionLocalMax:0
   iopsCongestionLocalMax:0
   logCongestionLocalMax:0
   compCongestionLocalMax:0 
529dd4dc-####-####-####-###############
  LLOG consumption: 0.270882 PLOG consumption: 0.632553 Total log consumption: 0.903435

Script 2 -- Verify current values of "Number of elements in commit tables":

vsish -e ls /vmkModules/lsom/disks/ 2>/dev/null | while read d ; do echo -n ${d/\//} ; vsish -e get /vmkModules/lsom/disks/${d}WBQStats | grep "Number of elements in commit tables" ; done | grep -v ":0$"

Sample output for two Disk Groups on a Host (please verify that lines returned match all Cache disks (ignore any Capacity disks):

529395f3-####-####-####-###############/   Number of elements in commit tables:300891    >> Disk Group affected ( = Value > 100K )
526709f4-####-####-####-###############/   Number of elements in commit tables:289371    >> Disk Group affected ( = Value > 100K )

 

Environment

This specific cause of LSOM Memory Congestion is observed on the following vSphere/vSAN Releases :
 
vSAN 6.7.x:
--->  From vSAN 6.7 U3 P04 ( 17167734 ) and before 6.7 U3 P05 ( 17700523 ) containing:
ESXi 6.7 Update 3 P04: 17167734
ESXi 6.7 Update 3 EP18: 17499825
 
vSAN 7.0.x
--->  From 7.0 U1c ( 17325551 ) before 7.0 U2 GA ( 17630552 )
ESXi 7.0 Update 1c ( 17325551 )
ESXi 7.0 Update 1d ( 17551050 )

Cause

High LSOM Memory Congestion caused by high Commit Table entries.
 
Scrubber configuration values were modified in vSAN 6.7 U3 P04 ( 17167734 ) and vSAN 7.0 U1c ( 17325551 ) releases to scrub vSAN Objects at a higher frequency.
This results in persisting scrubber progress of each vSAN Object more frequently than before.
 
If there are idle vSAN Objects in the Cluster, then commit table entries for these vSAN Objects created by the scrubber will accumulate at LSOM.
Eventually, the accumulation will lead to LSOM Memory Congestion.( Idle vSAN Objects in this context refer to vSAN Objects which are unassociated / powered off VMs / replicated vSAN Objects..etc. )

Resolution

Fixes are available in:

  • vSAN 6.7 P05 ( 17700523 )
  • vSAN 7.0 U2 GA ( 17630552 )

If any of the vSAN Hosts in the Cluster is showing "Number of elements in the commit tables" > 100K:

NOTE:
Even if no LSOM Memory Congestion has been observed but vSAN Hosts are on the affected Builds mentioned above
and showing for any of their Disk Groups "Number of elements in the commit tables" > 100K (as outlined in section "Issue/Introduction")
Perform the following steps in the descending order ( = Hosts/Disk Groups having highest values to the lowest values)
  1. Preparation: 
    Put one Host in Maintenance Mode with "Ensure Accessibility" (only if the Host has to be rebooted)
    and/or unmount and remount its Disk Groups by logging into it via SSH/Putty and executing:
    esxcli vsan storage diskgroup unmount -s ############ 
    esxcli vsan storage diskgroup mount -s ############
    ( Where ############ stands for the Cache Tier managing the Disk Group (= e.g. naa.xxxx, eui.xxxx, t10.NVMexxxxx )
  2. Reboot Host if needed (depending on its responsiveness)
  3. Put Host out of Maintenance Mode
  4. Repeat Steps 1-3 for the next vSAN Host
  5. Execute the following commands on allvSAN Hosts:
    1. Change scrubber frequency to once per year (remove "#" for running):
      # esxcfg-advcfg -s 1 /VSAN/ObjectScrubsPerYear
    2. Disable scrubber persist timer (remove "#" for running):
      # esxcfg-advcfg -s 0 /VSAN/ObjectScrubPersistMin

If assistance is required, please Creating and managing Broadcom support cases

Additional Information

Attachments

configure-dom-scrubber-frequency get_app