vSAN ESA: Physical Disk 'Operation' Alarm and Fluctuating Skyline Health

search cancel

vSAN ESA: Physical Disk 'Operation' Alarm and Fluctuating Skyline Health

book

Article ID: 436122

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

In a vSAN Express Storage Architecture (ESA) environment, vCenter may trigger a physical disk "Operation" alarm. This is typically accompanied by a fluctuating Skyline Health score as the system attempts and fails to repair transient errors on an NVMe device.

Symptoms

Skyline Health reports "Physical disk operation health" in red.
The number of detected disks in vSAN Disk Management is lower than the expected physical count.
Running vdq -Hi on the host returns an I/O timeout error: VsanUtil::AIO_ReadWriteDeviceWithTimeOut: Device: /vmfs/devices/disks/[ID], read 0 out of 4096 errno 2

Environment

VMware vSAN 8.x
vSAN Express Storage Architecture (ESA)

Cause

Underlying hardware degradation of an NVMe device results in unrecoverable metadata read timeouts. Because ESA utilizes a single-tier storage pool, persistent I/O failures on one device trigger the driver to offline the controller to prevent cluster-wide storage stalls.

Resolution

Locate the Device: Identify the physical Box and Bay of the failing NVMe disk: localcli storage core device physical get -d [Device_ID]
Check SMART Data: Review the drive health parameters: localcli storage core device smart get -d [Device_ID]
Evacuation: If possible, ensure data is evacuated from the impacted host.
Vendor Engagement: Contact your hardware provider (e.g., HPE, Dell) for a physical disk replacement. Provide the specific Box/Bay location and the vdq error output.

Feedback

thumb_up Yes

thumb_down No