vSAN LSOM Elevator stopped causing high SSD/Log Congestion

search cancel

vSAN LSOM Elevator stopped causing high SSD/Log Congestion

book

Article ID: 326650

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

Following symptoms can be seen:

Running ESXi 7.0 Update 1 or later
'SSD Congestion' alarms in Skyline Health point to one or several DiskGroups in the cluster
Increasing 'ssdCongestion' / 'logCongestion' values when running GSS congestion check one liner:

Example:

# for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done

Tue Jul 20 19:56:18 UTC 2021
########-####-####-####-########5409
   memCongestion:0
   slabCongestion:0
   ssdCongestion:227 <------- This is already too high, seeing a value <100 but incrementing is already enough to suspect.
   iopsCongestion:0
   logCongestion:0 <------- In some cases logCongestion has increased and no ssdCongestion is present.
   compCongestion:0
   mdCongestion:0
   memCongestionLocalMax:0
   slabCongestionLocalMax:0
   ssdCongestionLocalMax:227
   iopsCongestionLocalMax:0
   logCongestionLocalMax:0
   compCongestionLocalMax:0
   mdCongestionLocalMax:0

Following the DiskGroup's host in question, if you go to 'Host → Monitor → vSAN → Performance → Disks → Diskgroup → ', the "Write Buffer Free Percentage" is <70% and there is no throughput showing up at the "Cache Disk De-stage Rate" metric

Environment

VMware vSAN 7.0.x

Cause

Due to an underflow of the outstanding IO counter, vSAN elevator thinks that the capacity device already has outstanding IO to be de-staged and waits for that to complete before it can de-stage the next data. However, there are no pending IOs to complete with the capacity disk. Hence, we end up with no data being de-staged by the elevator.

Resolution

Fixed in vSAN 7.0 U3g (EP5), please update to this build or newer to address the issue.

Additional Information

Impact/Risks:

Overall vSAN performance could be impacted if PLOG consumption buildup has already caused vSAN congestion
VMs may start presenting different problems such as:
- Increased latency
- Switching to a "Read-Only" mode
- Guest OS getting stuck

Feedback

thumb_up Yes

thumb_down No