You encounter the following issues in your environment:
vSAN ESA 8.x
These symptoms occur due to a hardware failure on an NVMe disk (e.g., ####.NVMe____#############). The disk experiences LSOM operation latencies exceeding 40 seconds and medium/checksum errors.
The vSAN Dying Disk Handling (DDH) mechanism does not automatically evacuate the disk because the IOPS remain below the threshold (<100) required for endemic failure detection to trigger. This causes the storage stack to "hang," leading to management service unresponsiveness.
Logs from the affected host may show deteriorated performance and SMART data indicators:
WARNING: StorageDeviceIO: 201: Device ####.NVMe____############# performance has deteriorated. I/O latency increased from average value of 174 microseconds to 626548 microseconds.availableSpare drops significantly (e.g., to 45 with a threshold of 10) and mediaIntegrityErrors are present (e.g., 0x351e).To resolve this issue, you must replace the faulty hardware and adjust configuration values to trigger faster timeouts Identify and replace the faulty NVMe disk through your hardware vendor (e.g., HPE).
Proactivity apply advanced configuration changes to all hosts in the cluster to lower the maximum IO timeout and retry values. This ensures vSAN triggers a timeout sooner when a disk underperforms.
Note: These lower values are aligned with improvements introduced in later vSAN releases (9.1) to handle similar ESA disk failure scenarios.
/LSOM/diskIoRetryFactor = 1/LSOM/diskIoTimeout = 20000While the symptoms are seen on the ESXi host, the impact propagates to the vCenter Server Appliance (VCSA) if it resides on the affected datastore