Cluster high latency and congestion due to cache tier disk degradation
search cancel

Cluster high latency and congestion due to cache tier disk degradation

book

Article ID: 432963

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware Cloud Foundation VMware Telco Cloud Platform

Issue/Introduction

  • VMware vSAN environments may experience severe, cluster-wide high latency and I/O congestion. Troubleshooting reveals high Outstanding I/O isolated to a specific ESXi host.
  • Reviewing the /var/log/vmkernel.log on the affected host shows the following recurring ScsiDeviceIO warning:
    WARNING: ScsiDeviceIO: 1513: Device naa.############## performance has deteriorated. I/O latency increased from average value of 1206 microseconds to 396174 microseconds.

Environment

ESXi: 7.0u3

VCF: 4.x

TCP: 3.x

Cause

This issue occurs when a physical flash caching device within a vSAN disk group suffers a hardware failure or severe performance degradation, creating an I/O bottleneck that manifests as elevated latency across the cluster.

Resolution

  1. Identify the affected ESXi host and note the exact device identifier (`naa.###########`) from the `vmkernel.log` ScsiDeviceIO warning.
  2. Access the vSphere Web Client and navigate to the vSAN cluster > Configure > vSAN > Disk Management.
  3. Correlate the device ID to the specific disk group and cache tier disk.
  4. Remove the affected disk group (evacuating data if possible/applicable based on cluster health).
  5. Physically replace the degraded flash caching device on the ESXi host.
  6. Recreate the vSAN disk group utilizing the newly installed flash caching device.

Additional Information

Broadcom TechDocs: Replace a Flash Caching Device in a vSAN Cluster Replace a Flash Caching Device on a Host