SSD congestion due to failing storage controller.
search cancel

SSD congestion due to failing storage controller.

book

Article ID: 408799

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

SSD congestion is observed in a vSAN Cluster and all disks residing on the same storage controller show read/write failures. 

SSD congestion was collected via the following one liner. 

while true; do echo "================================================"; date; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do echo $ssd;vsish -e get /vmkModules/lsom/disks/$ssd/info|grep Congestion;done; for ssd in $(localcli vsan storage list |grep "Group UUID"|awk '{print $5}'|sort -u);do llogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by LLOG"|awk -F \: '{print $2}');plogTotal=$(vsish -e get /vmkModules/lsom/disks/$ssd/info|grep "Log space consumed by PLOG"|awk -F \: '{print $2}');llogGib=$(echo $llogTotal |awk '{print $1 / 1073741824}');plogGib=$(echo $plogTotal |awk '{print $1 / 1073741824}');allGibTotal=$(expr $llogTotal \+ $plogTotal|awk '{print $1 / 1073741824}');echo $ssd;echo "    LLOG consumption: $llogGib";echo "    PLOG consumption: $plogGib";echo "    Total log consumption: $allGibTotal";done;sleep 30; done

Example 

 

storage core device stats get

naa.##########549700
   Device: naa.##########549700
   Successful Commands: 7541554
   Blocks Read: 35393757
   Blocks Written: 42388956
   Read Operations: 3797796
   Write Operations: 3741974
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 48887
   Failed Blocks Read: 58248736
   Failed Blocks Written: 71827488
   Failed Read Operations: 6646
   Failed Write Operations: 42234
   Failed Reserve Operations: 0

naa.##########5496e0
   Device: naa.##########5496e0
   Successful Commands: 7511603
   Blocks Read: 35130001
   Blocks Written: 42371440
   Read Operations: 3768446
   Write Operations: 3741371
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 48698
   Failed Blocks Read: 56475648
   Failed Blocks Written: 73730080
   Failed Read Operations: 6307
   Failed Write Operations: 42385
   Failed Reserve Operations: 0


naa.##########549710
   Device: naa.##########549710
   Successful Commands: 7557820
   Blocks Read: 35155268
   Blocks Written: 42740640
   Read Operations: 3768906
   Write Operations: 3787124
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 49338
   Failed Blocks Read: 58300928
   Failed Blocks Written: 71985248
   Failed Read Operations: 6812
   Failed Write Operations: 42518
   Failed Reserve Operations: 0

naa.##########5496f0
   Device: naa.##########5496f0
   Successful Commands: 14001031
   Blocks Read: 450457838
   Blocks Written: 41458280
   Read Operations: 10129220
   Write Operations: 3870029
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 89696
   Failed Blocks Read: 1296358928
   Failed Blocks Written: 57525248
   Failed Read Operations: 46238
   Failed Write Operations: 43445
   Failed Reserve Operations: 0

naa.#########53c789
   Device: naa.##########53c789
   Successful Commands: 23053319
   Blocks Read: 116976309
   Blocks Written: 343105728
   Read Operations: 5079311
   Write Operations: 17972076
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 247689
   Failed Blocks Read: 220614144
   Failed Blocks Written: 667578068
   Failed Read Operations: 29980
   Failed Write Operations: 217645
   Failed Reserve Operations: 0

naa.##########549770
   Device: naa.##########549770
   Successful Commands: 13566442
   Blocks Read: 424205607
   Blocks Written: 39037068
   Read Operations: 9999472
   Write Operations: 3565194
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 86038
   Failed Blocks Read: 1185489944
   Failed Blocks Written: 55110144
   Failed Read Operations: 43902
   Failed Write Operations: 42130
   Failed Reserve Operations: 0

Environment

VMware vSAN( All Versions)

Cause

Due to read and write failures on all disks presented to ESXi via the storage controller that is failing/failed.  vSAN was not able to de-stage data to the impacted capacity drives correctly leading to write buffers of the Cache disks filling and SSD congestion was introduced to relieve the I/O bottleneck.  

Congestion is a flow control mechanism used by vSAN. Whenever there is a bottleneck in a lower layer of vSAN (closer to the physical storage devices), vSAN uses this flow control (aka congestion) mechanism to relieve the bottleneck in the lower layer and instead reduce the rate of incoming I/O at the vSAN ingress, (i.e. vSAN Client VMs). This reduction of the incoming rate is done by introducing an I/O delay at the ingress that is equivalent to the delay the I/O would have occurred due to the bottleneck at the lower layer. Thus, it is an effective way to shift latency from the lower layers to the ingress without changing the overall throughput of the system. vSAN measures congestion as a scalar value between 0 to 255, and the introduced delay is computed using a randomized exponential backoff method, based on the congestion metric.

 

Resolution

Place impacted host in to Maintenance Mode via ensure accessibility and reach out to your hardware vendor to investigate the failed/failing storage controller.  

If you have any additional questions, or concerns regarding this issue please open a case with vSAN Support for further investigation.

Additional Information