Windows VMs with VBS Enabled Freeze and CPU usage shows 100%

Products

VMware vSphere ESXi

Issue/Introduction

Random Windows VMs with VBS enabled go in hung state and become completely unresponsive.
The VM shows 100% CPU utilization in vSphere.
Performing a vMotion or taking a virtual machine snapshot instantly unfreezes the guest OS and restores normal operation.
Esxi logs would show storage disconnection or packet drops.

/var/run/log/vmkernel.log
YYYY-MM-DDTHH:MM:SS.542Z In(182) vmkernel: cpu33:2098430)NMP: nmp_ThrottleLogForDevice:3893: Cmd 0x2a (0x45cb22ea6900, 3123469) to dev "naa.##############################" on path "vmhba0:C0:T12:L8" Failed:
YYYY-MM-DDTHH:MM:SS.542Z In(182) vmkernel: cpu33:2098430)NMP: nmp_ThrottleLogForDevice:3898: H:0x7 D:0x2 P:0x0 . Act:EVAL. cmdId.initiator=0x43108d94a0c0 CmdSN 0x333
YYYY-MM-DDTHH:MM:SS.542Z Wa(180) vmkwarning: cpu33:2098430)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:235: NMP device "naa.##############################" state in doubt; requested fast path state update...
YYYY-MM-DDTHH:MM:SS.542Z In(182) vmkernel: cpu33:2098430)ScsiDeviceIO: 4686: Cmd(0x45cb22ea6900) 0x2a, CmdSN 0x333 from world 3123469 to dev "naa.##############################" failed H:0x7 D:0x2 P:0x0
YYYY-MM-DDTHH:MM:SS.487Z In(182) vmkernel: cpu99:2109717)qlnativefc: vmhba0(c9:0.0): qlnativefcStatusEntry:2076:C0:T11:L3 - FCP command status: 0x15-0x202 (0x7) portid=6b5200 oxid=0x8eb cdb=8a0000 len=4096 rspInfo=0x0 resid=0x0 fwResid=0x1040 host status = 0x7 device sta$
YYYY-MM-DDTHH:MM:SS.487Z In(182) vmkernel: cpu98:2098436)ScsiDeviceIO: 4686: Cmd(0x45eb36fbfcc0) 0x8a, CmdSN 0x28 from world 2114474 to dev "naa.##############################" failed H:0x7 D:0x2 P:0x0
YYYY-MM-DDTHH:MM:SS.487Z In(182) vmkernel: cpu99:2109717)qlnativefc: vmhba0(c9:0.0): qlnativefcStatusEntry:1927:(11:3) Dropped frame(s) detected (12480 of 12288 bytes).
YYYY-MM-DDTHH:MM:SS.487Z In(182) vmkernel: cpu99:2109717)qlnativefc: vmhba0(c9:0.0): qlnativefcStatusEntry:2076:C0:T11:L3 - FCP command status: 0x15-0x202 (0x7) portid=6b5200 oxid=0x8ec cdb=8a0000 len=12288 rspInfo=0x0 resid=0x0 fwResid=0x30c0 host status = 0x7 device st$
YYYY-MM-DDTHH:MM:SS.487Z In(182) vmkernel: cpu99:2109717)qlnativefc: vmhba0(c9:0.0): qlnativefcStatusEntry:1927:(11:3) Dropped frame(s) detected (532480 of 524288 bytes).
YYYY-MM-DDTHH:MM:SS.487Z In(182) vmkernel: cpu98:2098436)ScsiDeviceIO: 4686: Cmd(0x45eb36e8c4c0) 0x8a, CmdSN 0xa7 from world 2114474 to dev "naa.##############################" failed H:0x7 D:0x2 P:0x0
YYYY-MM-DDTHH:MM:SS.487Z In(182) vmkernel: cpu99:2109717)qlnativefc: vmhba0(c9:0.0): qlnativefcStatusEntry:2076:C0:T11:L3 - FCP command status: 0x15-0x202 (0x7) portid=6b5200 oxid=0x8ef cdb=8a0000 len=524288 rspInfo=0x0 resid=0x0 fwResid=0x82000 host status = 0x7 device $
YYYY-MM-DDTHH:MM:SS.487Z In(182) vmkernel: cpu99:2109717)qlnativefc: vmhba0(c9:0.0): qlnativefcStatusEntry:1927:(11:3) Dropped frame(s) detected (120640 of 118784 bytes).

/var/run/log/vobd.log
YYYY-MM-DDTHH:MM:SS.957Z In(14) vobd[2098148]: [vmfsCorrelator] 778489926280us: [vob.vmfs.heartbeat.recovered] Reclaimed heartbeat for volume ######-######-######-###### (<Datastore>): [Timeout] [HB state abcdef02 offset 3411968 gen 56719 stampUS 778489925319 uuid ######-######-######-###### jrnl <FB 25165825> drv 24.82]
YYYY-MM-DDTHH:MM:SS.957Z In(14) vobd[2098148]: [vmfsCorrelator] 778508055697us: [esx.problem.vmfs.heartbeat.recovered] ######-######-######-###### <Datastore>

Environment

VMware Esxi 8.x
VMware Esx 9.0

Cause

Due to underlying storage issue VM I/O was stuck for long time and resulted un VM going in hung state.
- When a physical storage command is stuck, the active Direct Memory Access (DMA) cannot complete. Simultaneously, a thread inside the guest OS attempts to flush the IOMMU translation cache, which requires all DMA to drain first. This thread blocks waiting for the stuck storage command, causing a cascade that freezes all other CPU threads. The OS becomes completely deadlocked, preventing the Windows storage stack from executing its normal SCSI timeout and reset routines.

Resolution

To resolve this issue, you must stabilize the underlying storage fabric, contact your storage vendor to investigate the storage disconnect and packet drops.

Stabilize Storage Fabric
- Identify the specific HBA and physical interconnect reporting frame drops.
- Inspect all physical components along the affected path, including cables, SFP transceivers, and switch ports.
- Replace any degraded hardware to ensure the storage layer remains stable.
- Ensure the firmware and drivers are latest as per compatibility matrix.
Workaround 1: Disable VBS
- If storage instability cannot be immediately corrected, you can prevent the deadlock by disabling Virtualization-Based Security (Activate or Deactivate VBS on an Existing VM in the VMware Host Client) on the affected VMs. This allows the Windows storage stack to handle I/O timeouts and resets normally without deadlocking the kernel.
Workaround 2: Future ESXi Releases
- A mechanism is being introduced in future ESXi releases, where the ESXi storage stack attempts to abort any stuck I/Os that the guest operating system fails to abort.
- This is a "best effort" recovery measure; however, stabilizing the storage layer remains the primary recommendation.