One or more virtual machines may crash or become unresponsive due to a vSAN NVMe disk failure in vSAN OSA Cluster.
search cancel

One or more virtual machines may crash or become unresponsive due to a vSAN NVMe disk failure in vSAN OSA Cluster.

book

Article ID: 405733

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms : 

  • One or more virtual machines may crash or become unresponsive.

    • Impacted Virtual machine vmware.log indicates the VM crash event.

      2025-03-31T15:19:56.869Z In(05)+ vcpu-0 - The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

  • VMs may experience high I/O wait times or degraded performance.

  • vSAN reports heartbeat timeout events for VM namespaces:

    2025-03-31T15:19:07.538Z In(14) vobd[2098025]  [vmfsCorrelator] 18087105335117us: [vob.vmfs.heartbeat.timedout] xxxxxxxx-xxxxxxxx-xxxx-xxxxxxxxxxxx xxxxxxxx-xxxxxxxx-xxxx-xxxxxxxxxxxx
    2025-03-31T15:19:07.538Z In(14) vobd[2098025]  [vmfsCorrelator] 18087670645103us: [esx.problem.vmfs.heartbeat.timedout] xxxxxxxx-xxxxxxxx-xxxx-xxxxxxxxxxxx xxxxxxxx-xxxxxxxx-xxxx-xxxxxxxxxxxx

  • NVMe devices marks as Permanent Device Loss (PDL) : 

    2025-03-31T15:21:39.844Z Wa(180) vmkwarning: cpu56:2097899)WARNING: NvmeDeviceIO: 1696: Command 0x2 to device "t10.NVMe____Xytr_Xyz_NVMe_SED_Z9980_RT_U.2_3.84TB___A1B2C3D4E5F6G7H8" marked for PDL virtual reset completed with  abort/reset

  • vsandevicemonitord.log continues to report the disk state as under STUCK IO for a long time:

    2025-03-31T15:27:03Z In(14) vsandevicemonitord[2100914] Device t10.NVMe____Xytr_Xyz_NVMe_SED_Z9980_RT_U.2_3.84TB___A1B2C3D4E5F6G7H8 state is DISK_UNDER_STUCK_IO
    2025-04-01T07:57:54Z In(14) vsandevicemonitord[2100914] Device t10.NVMe____Xytr_Xyz_NVMe_SED_Z9980_RT_U.2_3.84TB___A1B2C3D4E5F6G7H8 state is DISK_UNDER_STUCK_IO

 

Environment

VMware vSAN

Cause

A known issue affecting all ESXi versions prior to 8.0 P05 involves a race condition between transient error handling and APD (All Paths Down) error handling. This condition is resolved only after all outstanding I/O operations to the Log-Structured Object Manager (LSOM) are completed.

The race condition is typically triggered by transient NVMe disk errors, which are often the result of underlying hardware or firmware anomalies.

Refer : vSAN Networking Terms and Definitions

Resolution

The identified race condition has been resolved in VMware ESXi 8.0 Update 3e (Build 24674464), also known as ESXi 8.0 P05.

For vSAN NVMe transient disk issues, it is recommended to engage your hardware vendor to investigate potential hardware or firmware-related causes, as such errors often originate from underlying hardware issue.

Note :  

  • With the release of VMware ESXi 8.0 P05, enhancements have been introduced to improve the handling of stuck I/O conditions on NVMe devices. These enhancements include support for configuring the ScsiTMHardTimeout parameter for NVMe devices, which can help reduce the time required to detect and report disk failure events.

  • The minimum supported value for ScsiTMHardTimeout  is 30 seconds when using ESXi 8.0 P05 or later.

  • The default value of ScsiTMHardTimeout is 120 seconds.

  • If there is a need to modify the ScsiTMHardTimeout setting, please contact Broadcom Support for further assistance.

Additional Information