vSAN ESA Cluster High Latency and vCenter Management Unresponsiveness

search cancel

vSAN ESA Cluster High Latency and vCenter Management Unresponsiveness

book

Article ID: 439608

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

You encounter the following issues in your environment:

The vCenter Server Web Interface is not responsive.
Persistent, cluster-wide write latency degradation without active resync or heavy workload.
VMs (Virtual Machines) are unresponsive or report high memory and CPU utilization.
A host fails to enter maintenance mode due to ongoing resync operations.
NVMEs shows errors like "this disk backup device has failed." but are not being removed from the disk pool

Environment

vSAN ESA 8.x

Cause

These symptoms occur due to a hardware failure on an NVMe disk (e.g., ####.NVMe____#############). The disk experiences LSOM operation latencies exceeding 40 seconds and medium/checksum errors.

The vSAN Dying Disk Handling (DDH) mechanism does not automatically evacuate the disk because the IOPS remain below the threshold (<100) required for endemic failure detection to trigger. This causes the storage stack to "hang," leading to management service unresponsiveness.

Logs from the affected host may show deteriorated performance and SMART data indicators:

Latency Warning: WARNING: StorageDeviceIO: 201: Device ####.NVMe____############# performance has deteriorated. I/O latency increased from average value of 174 microseconds to 626548 microseconds.
SMART Errors: availableSpare drops significantly (e.g., to 45 with a threshold of 10) and mediaIntegrityErrors are present (e.g., 0x351e).

Resolution

To resolve this issue, you must replace the faulty hardware and adjust configuration values to trigger faster timeouts Identify and replace the faulty NVMe disk through your hardware vendor (e.g., HPE).

Proactivity apply advanced configuration changes to all hosts in the cluster to lower the maximum IO timeout and retry values. This ensures vSAN triggers a timeout sooner when a disk underperforms.
Note: These lower values are aligned with improvements introduced in later vSAN releases (9.1) to handle similar ESA disk failure scenarios.

- Set /LSOM/diskIoRetryFactor = 1
- Set /LSOM/diskIoTimeout = 20000

Additional Information

While the symptoms are seen on the ESXi host, the impact propagates to the vCenter Server Appliance (VCSA) if it resides on the affected datastore

Feedback

thumb_up Yes

thumb_down No