vSAN ESA: Physical Disk 'Operation' Alarm and Fluctuating Skyline Health
search cancel

vSAN ESA: Physical Disk 'Operation' Alarm and Fluctuating Skyline Health

book

Article ID: 436122

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

In a vSAN Express Storage Architecture (ESA) environment, vCenter may trigger a physical disk "Operation" alarm. This is typically accompanied by a fluctuating Skyline Health score as the system attempts and fails to repair transient errors on an NVMe device.

Symptoms

  • Skyline Health reports "Physical disk operation health" in red.
  • The number of detected disks in vSAN Disk Management is lower than the expected physical count.
  • Running vdq -Hi on the host returns an I/O timeout error: VsanUtil::AIO_ReadWriteDeviceWithTimeOut: Device: /vmfs/devices/disks/[ID], read 0 out of 4096 errno 2

 

Environment

  • VMware vSAN 8.x
  • vSAN Express Storage Architecture (ESA)

Cause

Underlying hardware degradation of an NVMe device results in unrecoverable metadata read timeouts. Because ESA utilizes a single-tier storage pool, persistent I/O failures on one device trigger the driver to offline the controller to prevent cluster-wide storage stalls.

Resolution

  1. Locate the Device: Identify the physical Box and Bay of the failing NVMe disk: localcli storage core device physical get -d [Device_ID]
  2. Check SMART Data: Review the drive health parameters: localcli storage core device smart get -d [Device_ID]
  3. Evacuation: If possible, ensure data is evacuated from the impacted host.
  4. Vendor Engagement: Contact your hardware provider (e.g., HPE, Dell) for a physical disk replacement. Provide the specific Box/Bay location and the vdq error output.