vSAN VM snapshot stuns and quiesce timeouts caused by VMkernel Multihoming
search cancel

vSAN VM snapshot stuns and quiesce timeouts caused by VMkernel Multihoming

book

Article ID: 437724

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

When operating a vSAN cluster (specifically during high-I/O events like backup windows), the cluster may experience a "Congestion Collapse," resulting in the following symptoms:

  • vCenter triggers alarm.HostErrorAlarm for dropped syslog messages (e.g., vmsyslog logger [IP] lost [X] log messages) due to log buffer overruns.

  • Virtual Machines experience severe latency, freezing, or snapshot stun operations lasting significantly longer than expected (e.g., > 20 seconds).

  • Backup jobs fail or frequently report "quiesce timeout" notifications.

  • vSAN Performance charts show high Reliable Datagram Transport (RDT) network latency.

  • Reviewing the vmkernel.log reveals thousands of vSAN IO throttling and congestion events, specifically logging strings such as:

    • WARNING: vSAN: PLOGElevBaseHandler: ... Throttled

    • WARNING: vSAN: PLOGZeroElevBaseHandler... Throttled

    • LSOM: LSOMGetSSDCongestion:1242: Throttled: Stuck elevator detection triggered

  • Reviewing network statistics via nicinfo.sh or esxcli network nic stats get reveals massive Receive missed errors (hardware buffer overflows) or sw csum error rx (Software Checksum Errors).

Environment

VMware vSAN 8.x
Dell PowerEdge R750
HPE ProLiant DX380

Cause

This issue is caused by a "Congestion Collapse" event within the vSAN cluster, typically driven by a combination of architectural misconfigurations and physical layer saturation:

  1. VMkernel Multihoming (Primary Cause): The ESXi hosts have multiple VMkernel adapters (e.g., Management, vMotion, and vSAN) configured within the exact same IP subnet on the default TCP/IP stack. Per VMware routing behavior, ESXi routes all outbound traffic for that subnet through the lowest-numbered interface (typically vmk0). This forces massive vSAN and vMotion payloads to bypass their dedicated uplinks, resulting in severe asymmetric routing, physical switch MAC address flapping, and massive dropped packets. For a detailed explanation of why multihoming is unsupported and causes these routing failures, see Broadcom KB 318546: Multihoming on ESXi.

  2. Physical Layer Degradation: Faulty SFP+ modules, DAC cables, or degraded switch ports cause physical packet corruption (identified by sw csum error rx), exacerbating TCP retransmissions.

  3. vSAN Capacity Saturation: The vSAN capacity tier is operating above 80% utilization. High capacity utilization inherently slows down the destaging process from the Cache tier.

  4. Software Throttling: Because the network is dropping packets (due to multihoming and faulty cables), TCP retransmissions spike. This causes the vSAN DOM and Compression software threads to stall while waiting for network acknowledgments. Combined with the >80% capacity saturation, vSAN aggressively throttles incoming VM I/O to protect the cache tier from overflowing, directly causing the VM snapshot stuns and quiesce timeouts.

Resolution

To resolve this issue, the underlying network routing and saturation bottlenecks must be addressed. Steps 1-3 outline the Permanent Fix, while Steps 4-5 outline temporary Workarounds.

Permanent Fixes:

  1. Redesign VMkernel Networking (Isolate Subnets) You must separate vSAN, vMotion, and Management traffic into distinct, non-overlapping IP subnets backed by dedicated VLANs on the physical switch. This ensures ESXi routes traffic symmetrically out of the designated physical uplinks.

  2. Physical Layer Remediation Identify any physical NICs recording active checksum errors (sw csum error rx or PHY symbol errors). Inspect and replace the associated SFP+ modules, DAC cables, or fiber optics connecting the host to the Top-of-Rack (ToR) switch.

  3. Capacity Reclamation Ensure vSAN capacity tier utilization remains comfortably below the 80% threshold. Delete orphaned snapshots, powered-off VMs, or unneeded files to alleviate destaging pressure.

Workarounds:

  1. Workload Management Until the network routing and physical layer issues are resolved, stagger heavy I/O workloads such as concurrent backup schedules to reduce the peak GiB/s throughput pushing through the physical NIC buffers.

  2. Syslog Mitigation To prevent vCenter alarm fatigue from dropped syslog messages caused by the flood of vSAN throttling logs, you can apply a temporary log filter to the ESXi hosts to drop the excessive PLOG strings:

    Bash
     
    esxcli system syslog config logfilter add --filter="10|PLOG.*ElevBaseHandler.*Throttled|.*"
    esxcli system syslog config logfilter set --log-filtering-enabled=true
    esxcli system syslog reload
    

    (Note: This only suppresses the log output to protect the syslog daemon buffer; it does not resolve the underlying snapshot stun issue).