2025-03-31T15:19:56.869Z In(05)+ vcpu-0 - The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.
vSAN reports heartbeat timeout events for VM namespaces:
2025-03-31T15:19:07.538Z In(14) vobd[2098025] [vmfsCorrelator] 18087105335117us: [vob.vmfs.heartbeat.timedout] xxxxxxxx-xxxxxxxx-xxxx-xxxxxxxxxxxx xxxxxxxx-xxxxxxxx-xxxx-xxxxxxxxxxxx2025-03-31T15:19:07.538Z In(14) vobd[2098025] [vmfsCorrelator] 18087670645103us: [esx.problem.vmfs.heartbeat.timedout] xxxxxxxx-xxxxxxxx-xxxx-xxxxxxxxxxxx xxxxxxxx-xxxxxxxx-xxxx-xxxxxxxxxxxx
NVMe devices marks as Permanent Device Loss (PDL) :
2025-03-31T15:21:39.844Z Wa(180) vmkwarning: cpu56:2097899)WARNING: NvmeDeviceIO: 1696: Command 0x2 to device "t10.NVMe____Xytr_Xyz_NVMe_SED_Z9980_RT_U.2_3.84TB___A1B2C3D4E5F6G7H8" marked for PDL virtual reset completed with abort/reset
vsandevicemonitord.log continues to report the disk state as under STUCK IO for a long time:
2025-03-31T15:27:03Z In(14) vsandevicemonitord[2100914] Device t10.NVMe____Xytr_Xyz_NVMe_SED_Z9980_RT_U.2_3.84TB___A1B2C3D4E5F6G7H8 state is DISK_UNDER_STUCK_IO2025-04-01T07:57:54Z In(14) vsandevicemonitord[2100914] Device t10.NVMe____Xytr_Xyz_NVMe_SED_Z9980_RT_U.2_3.84TB___A1B2C3D4E5F6G7H8 state is DISK_UNDER_STUCK_IO
VMware vSAN
A known issue affecting all ESXi versions prior to 8.0 P05 involves a race condition between transient error handling and APD (All Paths Down) error handling. This condition is resolved only after all outstanding I/O operations to the Log-Structured Object Manager (LSOM) are completed.
The race condition is typically triggered by transient NVMe disk errors, which are often the result of underlying hardware or firmware anomalies.
The identified race condition has been resolved in VMware ESXi 8.0 Update 3e (Build 24674464), also known as ESXi 8.0 P05.
For vSAN NVMe transient disk issues, it is recommended to engage your hardware vendor to investigate potential hardware or firmware-related causes, as such errors often originate from underlying hardware issue.
Note :
ScsiTMHardTimeout parameter for NVMe devices, which can help reduce the time required to detect and report disk failure events.