SSD log buildup can cause poor performance in a VMware vSAN Cluster
search cancel

SSD log buildup can cause poor performance in a VMware vSAN Cluster

book

Article ID: 326870

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:
Under certain rare circumstances, vSAN (formerly known as Virtual SAN) can exhibit a behavior where the SSD/cache-tier logging space is filled. When this occurs, it leads to performance impact in the cluster as the SSD is unable to buffer inbound IO in a timely manner.
 
If this issue is encountered, you experience one or more of these symptoms:
  • Hosts periodically enter in to a not responding state in vCenter Server
  • Some virtual machines resident on vSAN exhibit extremely poor performance
  • Some virtual machines resident on vSAN may fail to power on due to timeout or IO error
There are several ways this issue can manifest. It is most commonly (though not exclusively) associated with messaging regarding persistently high SSD congestion or frequent oscillations in SSD congestion messaging. This information is conveyed in vmkernel.log file on the ESXi vSAN host(s).

Note: SSD congestion messaging is not necessarily associated with the issue described in this document. The presence of SSD congestion messaging is not itself a guarantee that this issue has been encountered.
  • Persistently-high SSD congestion messaging

    2015-10-21T07:05:09.294Z cpu5:33450)LSOM: LSOM_ThrowCongestionVOB:2912: Throttled: Virtual SAN node esxi-01.corp.local maximum SSD ########-####-####-####-########0eee congestion reached.
    2015-10-21T07:06:09.408Z cpu14:32817)LSOM: LSOM_ThrowCongestionVOB:2912: Throttled: Virtual SAN node esxi-01.corp.local maximum SSD ########-####-####-####-########0eee congestion reached.
    2015-10-21T07:07:09.491Z cpu13:33200)LSOM: LSOM_ThrowCongestionVOB:2912: Throttled: Virtual SAN node esxi-01.corp.local maximum SSD ########-####-####-####-########0eee congestion reached.

     
  • Oscillating SSD congestion messaging

    2015-10-20T05:55:15.773Z cpu34:33120)LSOM: LSOM_ThrowAsyncCongestionVOB:2127: LSOM SSD Congestion State: Normal. Congestion Threshold: 200 Current Congestion: 0.
    2015-10-20T05:55:15.775Z cpu34:33120)LSOM: LSOM_ThrowAsyncCongestionVOB:2127: LSOM SSD Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 255.
    2015-10-20T05:55:15.776Z cpu34:33120)LSOM: LSOM_ThrowAsyncCongestionVOB:2127: LSOM SSD Congestion State: Normal. Congestion Threshold: 200 Current Congestion: 0.
    2015-10-20T05:55:15.813Z cpu34:33120)LSOM: LSOM_ThrowAsyncCongestionVOB:2127: LSOM SSD Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 255.

 

Environment

VMware vSAN 6.x
VMware vSAN 7.x
VMware vSAN 8.x

Cause

This issue occurs due to a buildup of data in the write buffer of a disk group. This results in buffer exhaustion which ultimately has detrimental effects on I/O performance.
 
To prevent exhaustion of the vSAN write buffer of each disk group, the system gradually throttles back the rate of write operations as free buffer space is reduced. This is done by injecting gradually higher latencies to the processing of IO operations of the workloads. An adaptive algorithm is used that prevents overreaction to transient workload spikes by slowly increasing the synthetic delay as the buffer continues to fill. Ultimately, the algorithm ensures that the rate of incoming write operations can be matched by the rate of de-staging data from the buffer to the capacity tier.
 
In general circumstances, this mechanism is effective in avoiding buffer exhaustion even for the most write-intensive workloads. However, when the log leak issue is encountered, a number of log records remain in the log (not de-staged) and thus inhibit the effectiveness of the algorithm. As available buffer space is exhausted, the algorithm performs permanent aggressive throttling of inbound workloads for the affected disk groups and their dependent objects. This condition of permanent enforcement causes the extreme performance degradation that is observed.

Resolution

Engage VMware vSAN Support for further investigation if you encounter SSD congestion to determine the cause of the congestion and resolution.