SSD log buildup can cause poor performance in a VMware vSAN Cluster

Products

VMware vSAN

Issue/Introduction

Symptoms:

Under certain rare circumstances, vSAN (formerly known as Virtual SAN) can exhibit a behavior where the SSD/cache-tier logging space is filled. When this occurs, it leads to performance impact in the cluster as the SSD is unable to buffer inbound IO in a timely manner.
If this issue is encountered, you experience one or more of these symptoms:
- Hosts periodically enter in to a not responding state in vCenter Server.
- Some virtual machines resident on vSAN exhibit extremely poor performance.
- Some virtual machines resident on vSAN may fail to power on due to timeout or IO error
There are several ways this issue can manifest. It is most commonly (though not exclusively) associated with messaging regarding persistently high SSD congestion or frequent oscillations in SSD congestion messaging. This information is conveyed in vmkernel.log file on the ESXi vSAN host(s).

Note: SSD congestion messaging is not necessarily associated with the issue described in this document. The presence of SSD congestion messaging is not itself a guarantee that this issue has been encountered.

Persistently-high SSD congestion messaging

2015-10-21T07:05:09.294Z cpu5:33450)LSOM: LSOM_ThrowCongestionVOB:2912: Throttled: Virtual SAN node ######## maximum SSD ########-####-####-####-########0eee congestion reached.
2015-10-21T07:06:09.408Z cpu14:32817)LSOM: LSOM_ThrowCongestionVOB:2912: Throttled: Virtual SAN node ######## maximum SSD ########-####-####-####-########0eee congestion reached.
2015-10-21T07:07:09.491Z cpu13:33200)LSOM: LSOM_ThrowCongestionVOB:2912: Throttled: Virtual SAN node ######## maximum SSD ########-####-####-####-########0eee congestion reached.
Oscillating SSD congestion messaging

2015-10-20T05:55:15.773Z cpu34:33120)LSOM: LSOM_ThrowAsyncCongestionVOB:2127: LSOM SSD Congestion State: Normal. Congestion Threshold: 200 Current Congestion: 0.
2015-10-20T05:55:15.775Z cpu34:33120)LSOM: LSOM_ThrowAsyncCongestionVOB:2127: LSOM SSD Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 255.
2015-10-20T05:55:15.776Z cpu34:33120)LSOM: LSOM_ThrowAsyncCongestionVOB:2127: LSOM SSD Congestion State: Normal. Congestion Threshold: 200 Current Congestion: 0.
2015-10-20T05:55:15.813Z cpu34:33120)LSOM: LSOM_ThrowAsyncCongestionVOB:2127: LSOM SSD Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 255.

Environment

VMware vSAN 6.x

VMware vSAN 7.x

VMware vSAN 8.x

Cause

This issue occurs due to a buildup of data in the write buffer of a disk group. This results in buffer exhaustion which ultimately has detrimental effects on I/O performance.
To prevent exhaustion of the vSAN write buffer of each disk group, the system gradually throttles back the rate of write operations as free buffer space is reduced. This is done by injecting gradually higher latencies to the processing of IO operations of the workloads. An adaptive algorithm is used that prevents overreaction to transient workload spikes by slowly increasing the synthetic delay as the buffer continues to fill.
Ultimately, the algorithm ensures that the rate of incoming write operations can be matched by the rate of de-staging data from the buffer to the capacity tier.
In general circumstances, this mechanism is effective in avoiding buffer exhaustion even for the most write-intensive workloads. However, when the log leak issue is encountered, a number of log records remain in the log (not de-staged) and thus inhibit the effectiveness of the algorithm. As available buffer space is exhausted, the algorithm performs permanent aggressive throttling of inbound workloads for the affected disk groups and their dependent objects. This condition of permanent enforcement causes the extreme performance degradation that is observed.

Resolution

Possible resolutions to alleviate the latency issue caused by SSD Congestion:

IO profile for the VM can be tweaked to be 4K aligned, if not already. vSAN Performance is degraded when I/Os are not 4k aligned.
Split large and IO intensive vmdks which are attached to the affected VMs to multiple smaller vmdks, so that more DOM queues can be assigned.

If the above doesn't resolve the issue, engage Broadcom vSAN Support for further investigation to determine the cause of the congestion and resolution.