Cluster high latency and congestion due to cache tier disk degradation
search cancel

Cluster high latency and congestion due to cache tier disk degradation

book

Article ID: 432963

calendar_today

Updated On:

Products

VMware Telco Cloud Platform VMware vSAN VMware Cloud Foundation

Issue/Introduction

  • A VMware vSAN Original Storage Architecture (OSA) cluster experiences severe, cluster-wide high latency in Cluster -> Monitor -> VSAN -> Performance -> Disks, along with Congestion and High Outstanding I/O isolated to a specific ESXi host.

  • In Cluster -> Monitor -> vSAN -> Skyline Health, you see the vSAN object health alert:
    • Category: Data
    • Impact area: Compliance
    • Description: Provides a cluster wide overview by summarizing all objects in the cluster, grouping them in fine grained categories of object health. Notice this check will show as red if all hosts are disconnected or fail to query object health.
    • Risk if no action taken: There is object compliance issue which may either impact the performance or the capacity consumption.
  • You may also see unexpected resync activity in Cluster -> Monitor -> Resyncing Objects
  • You may see the following messages in the /var/log/vmkernel.log file:
    • Device Latency messages:
         WARNING: ScsiDeviceIO: 1513: Device naa.################ performance has deteriorated. I/O latency increased from average value of 1206 microseconds to 396174 microseconds.
    • Failing SCSI commands on the specific device:
WARNING: HPP: HppScsiThrottleLogForDevice:585: Cmd 0x2a (0x45dd530655c0, 0) to dev "naa.################" on path vmhba1:C0:T3:L0 Failed:
WARNING: HPP: HppScsiThrottleLogForDevice:593: Error status H:0x0 D:0x2 P:0x0 Valid sense data: 0xXX 0xXX 0xXX. hppAction=1
 
  • You may see the following messages in the /var/run/log/vsandevicemonitord.log file:
    • vSAN Dying Disk Handling (DDH) repair messages: 

2026-04-02T10:20:43Z In(14) vsandevicemonitord[2099815]: [577809138368]: Device naa.################ state is DISK_UNDER_REPAIR
2026-04-02T10:20:43Z In(14) vsandevicemonitord[2099815]: [577809138368]: Device naa.################ state is DISKGROUP_UNDER_REPAIR
2026-04-02T10:20:43Z In(14) vsandevicemonitord[2099815]: [577809138368]: Device naa.################ is DISKGROUP_UNDER_REPAIR
2026-04-02T10:20:43Z In(14) vsandevicemonitord[2099815]: [577809138368]: Device naa.################ is DISKGROUP_UNDER_REPAIR
2026-04-02T10:20:43Z In(14) vsandevicemonitord[2099815]: [577809138368]: Device naa.################state is DISKGROUP_UNDER_REPAIR
2026-04-02T10:20:53Z In(14) vsandevicemonitord[2099815]: [577809138368]: Unmount succeeded on VSAN device naa.################.
2026-04-02T10:20:53Z In(14) vsandevicemonitord[2099815]: [577809138368]: Device naa.################ was already unmounted.
2026-04-02T10:32:43Z In(14) vsandevicemonitord[2099815]: [577809138368]: Mount succeeded on VSAN device naa.################.
2026-04-02T10:32:43Z In(14) vsandevicemonitord[2099815]: [577809138368]: Repair successful for device ########-####-####-####-############

Environment

  • vSAN: 7.x, 8.x (OSA only)
  • VCF: 4.x
  • TCP: 3.x

Cause

  • This issue occurs when a physical flash caching device within a vSAN disk group suffers a hardware failure or severe performance degradation, creating an I/O bottleneck that manifests as elevated latency across the cluster.
  • Any resync activity is due to the Dying Disk Handling (DDH) taking the disk/disk group out of service. 

 

Resolution

1. Identify the affected ESXi host and note the exact device identifier (`naa.################`) from the `vmkernel.log` ScsiDeviceIO warning.
2. Access the vSphere Web Client and navigate to the vSAN cluster -> Configure -> vSAN -> Disk Management.
3. Confirm Disk Membership and Role:
    • Before removing any hardware, verify that the naa.################ ID identified in the logs is actively claimed by vSAN and determine its role (Cache vs. Capacity) to prevent the accidental removal of non-vSAN disks, such as boot devices or local VMFS volumes.
      • Run the following command to verify if the ID belongs to a cache or capacity disk:

# vdq -iH

Mappings:
   DiskMapping[0]:  
           SSD:  naa.################
            MD:  naa.################
            MD:  naa.################
            MD:  naa.################

   DiskMapping[2]:
           SSD:  naa.################
            MD:  naa.################
            MD:  naa.################
            MD:  naa.################

        • In-Group: If the disk is listed under SSD (Cache) or MD (Capacity), it is a vSAN disk.
        • Not Listed: If the ID does not appear, the disk is either not claimed by vSAN (e.g., a boot device/local VMFS) or has degraded so far that the host no longer recognizes it as a vSAN object.
      • Caution: If the device is not listed in the output above, do not proceed with vSAN disk removal steps. Investigate the device as a potential local boot or data volume.

4. Remove the affected disk group (evacuating data if possible/applicable based on cluster health).

5. Physically replace the degraded flash caching device on the ESXi host.
6. Recreate the vSAN disk group utilizing the newly installed flash caching device.
 

Additional Information