vSAN Cluster Performance Impact Due to Cache-Tier SSD Congestion During VM SQL Server Restore Operations

search cancel

vSAN Cluster Performance Impact Due to Cache-Tier SSD Congestion During VM SQL Server Restore Operations

book

Article ID: 402767

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms

During a SQL Server backup restore operation for virtual machine TEST, the vSAN cluster experiences degraded performance, characterized by:

Elevated vSAN cluster-wide read/write latency.
Increased vSAN backend latency observed during restore activity.
SSD congestion values exceeding threshold on host ESXi1.
- UUID: 5227bfcc-####-####-####-############
- ssdCongestion: 102
- ssdCongestionLocalMax: 102
- Congestion value reached: 125

esxcli vsan debug disk list

UUID: 5227bfcc-####-####-####-############
   Name: naa.500############
   Owner: ESXi1
   Version: 15
   Disk Group: 5227bfcc-####-####-####-############
   Disk Tier: Cache
   SSD: true
   In Cmmds: true
   In Vsi: true
   Fault Domain: N/A
   Model: MZILG800HCHQAD3
   Encryption: false
   Compression: true
   Deduplication: true
   Dedup Ratio: N/A
   Overall Health: green
   Metadata Health: green
   Operational Health: green
   Congestion Health:
         State: green
         Congestion Value: 112
         Congestion Area: ssd
         All Congestion Fields:
         SSD: 112
         Log: 0
         IOPS: 0
         Slab: 0
         Memory: 0
   Space Health:

The above is correlated with the backup restore activity from the VM TEST, hosted on the host ESXi1 in the cluster.

Environment

VMware vSAN 7.x

Cause

The performance issue was caused by excessive I/O load during the backup restoration operation on VM TEST.

Specifically:

Multiple outstanding I/O operations (OIO) built up during the restore.
The cache-tier SSD on ESXi1 became congested, impacting the ability to service new I/O.
This led to:
- vSAN backend congestion.
- Cluster-wide latency propagation.
Contributing factors:
- Storage Adapter (HBA) and NIC firmware/driver versions on ESXi1 were not fully aligned with the Broadcom Hardware Compatibility List (HCL), potentially resulting in suboptimal performance under load.

Resolution

Workaround:

Place Host ESXi1 into Maintenance Mode (Ensure Accessibility).
Reboot the Host.

Result:

SSD congestion is expected to clear following the host reboot.
vSAN backend latency is expected to normalize across the cluster post reboot.

Resolution:

Update HBA and NIC firmware/drivers on All ESXi on vSAN cluster to versions supported on the Broadcom HCL.
Ensure driver and firmware consistency across all hosts in the vSAN cluster.
Continue monitoring for:
- Latency spikes.
- Congestion patterns during I/O-intensive operations (e.g., backups, restores).

Additional Information

vSAN SSD congestion >100 is a strong indicator of severe queuing or overload on the cache device.
Congestion in one disk group can lead to performance degradation across the cluster.

vSAN memory or SSD congestion reached threshold limit

vSAN performance diagnostics reports: "vSAN is experiencing congestion in one or more disk group(s)"

Feedback

thumb_up Yes

thumb_down No