Identifying and replacing a vSAN capacity disk experiencing permanent I/O failures and extreme latency.
search cancel

Identifying and replacing a vSAN capacity disk experiencing permanent I/O failures and extreme latency.

book

Article ID: 437285

calendar_today

Updated On:

Products

VMware vSAN 8.x VMware vSAN 7.x

Issue/Introduction

A capacity disk in a vSAN cluster has been identified as failing, leading to degraded performance and potential data availability risks. The disk remains "In-use" by vSAN but is triggered as "Failed" or "Unhealthy" due to consistent I/O errors and latency spikes.

  • Identification Need: Map the logical device ID (naa.ID) found in the logs to a specific host and physical chassis slot.."
  • vSphere Client Alerts: You may see vSAN Skyline health monitoring reports for 'Disk health' or 'Physical disk' failures.".

  • Performance Impact: Virtual Machines experience "stuns" or high wait times for storage I/O.

  • Log Entries: The following entries appear in the vmkernel.log of the affected host:

    WARNING: Partition: 1387: Partition table read from device naa.################ failed: I/O error

    WARNING: ScsiDeviceIO: 1779: Device naa.################ performance has deteriorated. I/O latency increased... to 3462569 microseconds.

    2026-03-18T06:37:52.527Z In(182) vmkernel: cpu10:2098230)HPP: HppScsiLogError:329: last error status from device naa.################ repeated 2 times
    2026-03-18T06:37:53.453Z Wa(180) vmkwarning: cpu21:2097956)WARNING: ScsiDeviceIO: 1779: Device naa.################ performance has deteriorated. I/O latency increased from average value of 13854 microseconds to 7788911 microseconds.

Environment

vSAN OSA

Cause

The physical storage device is experiencing hardware degradation. This is evidenced by a high number of Failed Read Operations and latency spikes exceeding 3,000ms, causing the ESXi storage stack to timeout while attempting to communicate with the disk partition table.

Resolution

Step 1: Locate the Physical Disk

To ensure the correct drive is replaced in the physical server, use the ESXi command line to trigger the locator LED.

  1. Log in to the host via SSH.

  2. Run the following command (replace the device ID with your specific identifier): esxcli storage core device set -d <naa.ID> --led-state locator --led-duration 100

Step 2: Remove the Disk from vSAN Disk Management

Before physically removing the drive, you must logically remove it from the vSAN disk group.

  1. Navigate to the vSphere Client.

  2. Select the Cluster > Configure > vSAN > Disk Management.

  3. Select the host containing the failed disk.

  4. Under Disk Groups, select the affected group and locate the failed disk.

  5. Click Remove Disk.

  6. Data Migration Selection: * Select No Data Migration if the disk is already failing/timed out.

    • Note: In cases of extreme latency, attempting "Full Data Migration" may hang the task or impact cluster performance further.

Step 3: Physical Replacement

  1. Once the disk is removed from the UI and the locator LED is active, physically pull the drive.

  2. Insert the replacement drive.

  3. Return to Disk Management in the vSphere Client and use the Add Disks option to claim the new drive into the existing disk group.

Additional Information