vSAN performance diagnostics reports: "The increase in latency in the vSAN stack might be beyond expected limits”

Products

VMware vSAN

Issue/Introduction

This article explains the vSAN performance diagnostics issue: "The increase in latency in the vSAN stack might be beyond expected limits", why it might be showing up, and what possible solutions there are to address the issue.

Symptoms:

In the vSAN performance diagnostics, you see this message:

The increase in latency in the vSAN stack might be beyond expected limits.

Environment

VMware vSAN 6.x
VMware vSAN 7.x
VMware vSAN 8.x

VMware vSAN 9.x

Cause

This issue means that the latency observed in the virtual machine layer is much higher than the latency observed in the vSAN disk group layer. The detailed performance graphs show the latencies at both layers (VM layer and vSAN disk group layer) for each host in the cluster. Write and read latencies are separated.

This issue is displayed only when performance diagnostics are done for the "latency" goal.

Resolution

Here is a list of possible remedies:

Congestion:
1. Most often, the increase in latency may be correlated with the issue of congestion in one or more disk groups (KB 2150012). Congestion is a feedback mechanism to slow down the rate of incoming IO requests from the vSAN DOM client layer so that the rate of IO requests to the vSAN disk group can match the rate of IOs that the disk group can service.
2. When congestion is encountered, the vSAN DOM client layer queues more requests, and this leads to higher latency at the VM layer. In case the issue of “vSAN is experiencing congestion in one or more disk group(s)” is not separately reported by the performance diagnostics, you can check the levels of congestion by going to vSAN Cluster > Performance > vSAN Backend and examine the metric of congestion. Please refer how to address vSAN congestion in KB 2150012.
Network issues: A network issue, such as large packet losses, large number of duplicate ACKs, and retransmissions in the network, can lead to a large number of retransmissions of IO requests in the vSAN stack. Many network issues are caused by misconfiguration of switch VLAN policies or NIC teaming policies. You can monitor if packet losses are occurring in your network by investigating the following:
1. Go to vSAN cluster > Host > Performance > vSAN – Physical Adapters. For each physical adapter on a host, monitor the Network Adapter Packets Loss Rate. Repeat across all the hosts in the cluster. A packet loss rate greater than 0 means that the physical adapter is dropping packets. This might be due to incorrect firmware on the network adapter, an incorrect version of the device driver, or some misconfiguration in the settings of the switch to which this network adapter is attached. Please consult product and HCL documentation for details on how to address this issue.
2. If none of the physical adapters show any loss rate, go to vSAN cluster > Host > Performance > vSAN – VMkernel adapters aggregation. Look for vSAN Host Packets Loss Rate. Any value greater than 0 indicates that vSAN is encountering packet losses.
3. Go to vSAN cluster > Host > Performance > vSAN – VMkernel adapters and check the VMkernel Network Adapter Packet Loss Rate for each VMkernel Adapter. If a non-zero packet loss is seen at this layer, but no network loss is seen at any of the physical adapters on any of the hosts, then it usually means that there is a packet loss over the end-to-end network. This could be due to some incorrect configuration at the network switch, or some congestion over the network.
4. If the environment is a 2-node or Stretched Cluster configuration, please verify there are no connectivity issues between the vSAN nodes and the witness. See Troubleshooting vSAN Witness Node Isolation for more information.
Erasure coding(**OSA Specific**): It is possible that you have chosen Erasure Coding (RAID5/RAID6) in the storage policy. While Erasure Coding is a great feature for space savings, it does lead to higher write latencies because of the semantics in implementing RAID5/RAID6. If your primary goal is to achieve the best possible latency, then you may consider avoiding Erasure Coding in your storage policy. Please remember that if you apply a new storage policy to existing VM disks, this will trigger reconfiguration/recovery traffic. In such a case, please wait for recovery traffic to complete before re-running the benchmark. You can observe the status of recovery operations by monitoring at Cluster > vSAN > Resynching Components in your vCenter.
Unaligned 4K access pattern: In the case where write IOs are unaligned on 4K boundaries, these writes turn into read-modify-writes, as the 4K blocks have to be first read, and their contents changed before a write. This leads to increased latency across the stack. Please check if your benchmark can issue 4K aligned IOs, and in that case, please make the necessary changes to issue 4K aligned IOs.

Additional Information

vSAN performance diagnostics reports: "vSAN is experiencing congestion in one or more disk group(s)"