ESXi host may crash with a PSOD (purple screen of death) - Spin count exceeded

Products

VMware Live Recovery VMware vSphere ESXi

Issue/Introduction

To avoid crashing the ESXi host with a PSOD (purple screen of death).

Symptoms:

ESXi host may crash with a PSOD - Spin count exceeded - possible deadlock with PCPU
vSphere Replication is used to replicate the VMs.
The backtrace will be similar to the below

         Panic Details: Crash at YYYY-MM-DDTHH:MM:SS.825Z on CPU 0 running world 2102090. VMK Uptime:72:07:15:03.736
         Panic Message: @BlueScreen: NMI IPI: Panic requested by another PCPU. RIPOFF(base):RBP:CS [0xc42ce(0x418002400000):0x43190b62a500:0xfc8] (Src 0x4, CPU0)
         0x450a00002d10:[0x41800250ac15]PanicvPanicInt@vmkernel#nover+0x439 stack: 0x418002889fe0, 0x418002889f28, 0x450a00002db8, 0x43026acd2028, 0x450a00000001
         0x450a00002db0:[0x41800250aea1]Panic_WithBacktrace@vmkernel#nover+0x56 stack: 0x450a00002e20, 0x450a00002dd0, 0x0, 0x0, 0xc42ce
         0x450a00002e20:[0x418002507c91]NMI_Interrupt@vmkernel#nover+0x3c2 stack: 0x0, 0xfc8, 0x5320302075706370, 0x206b636f4c6e6970, 0x74756f206e697073
         0x450a00002ea0:[0x418002543ffc]IDTNMIWork@vmkernel#nover+0x99 stack: 0x0, 0x0, 0x0, 0x0, 0x0
   0x450a00002f20:[0x4180025454f0]Int2_NMI@vmkernel#nover+0x19 stack: 0x0, 0x418002560067, 0xfd0, 0xfd0, 0x0
         0x450a00002f40:[0x418002560066]gate_entry@vmkernel#nover+0x67 stack: 0x0, 0x0, 0xf, 0x1c0477, 0x0
         0x451ac901be88:[0x4180024c42ce]BitVector_NextBit@vmkernel#nover+0x46 stack: 0x41800385a75f, 0x16b7487, 0x432099c4dfd0, 0x43206200a730, 0x43209a0c8c30
         0x451ac901be98:[0x4180024c451b]BitVector_NextExtent@vmkernel#nover+0x4c stack: 0x432099c4dfd0, 0x43206200a730, 0x43209a0c8c30, 0x432062092050, 0x41800386f366
         0x451ac901bed0:[0x41800386f365]TransferDispatchExtent@(hbr_filter)#<None>+0xb2 stack: 0x418003870c3b, 0x418002514853, 0x8818c0, 0x4180025502c7, 0x451ac90232c0
         0x451ac901bf80:[0x418003870a99]ResourceWorld@(hbr_filter)#<None>+0xa2 stack: 0x43209a0c8c6c, 0x417fd92021c0, 0x0, 0x451ac9023000, 0x451ac6123100
         0x451ac901bfe0:[0x418002709112]CpuSched_StartWorld@vmkernel#nover+0x77 stack: 0x0, 0x0, 0x0, 0x0, 0x0
         Saved backtrace from: pcpu 0 SpinLock spin out NMI
   0x451ac901be88:[0x4180024c42cd]BitVector_NextBit@vmkernel#nover+0x46 stack: 0x41800385a75f
         0x451ac901be98:[0x4180024c451b]BitVector_NextExtent@vmkernel#nover+0x4c stack: 0x432099c4dfd0
         0x451ac901bed0:[0x41800386f365]TransferDispatchExtent@(hbr_filter)#<None>+0xb2 stack: 0x418003870c3b
         0x451ac901bf80:[0x418003870a99]ResourceWorld@(hbr_filter)#<None>+0xa2 stack: 0x43209a0c8c6c
         0x451ac901bfe0:[0x418002709112]CpuSched_StartWorld@vmkernel#nover+0x77 stack: 0x0

Environment

VMware vSphere ESXi 6.0
VMware vSphere ESXi 6.5
VMware vSphere ESXi 7.0.0
VMware vSphere Replication 6.x
VMware vSphere Replication 8.x
VMware vSphere ESXi 6.7

Cause

hbr_filter searches for a whole contiguous region in the transfer bitmap. This usually works well when the regions are small. When the regions are large enough (for ex. when full syncing large disk with checksumming disabled), iterating them may result in PSOD (because the disk lock is held for long time, this way exceeding the spin count of other contending cpu's)

Resolution

This issue is resolved in VMware vSphere ESXi 6.0 Patch ESXi600-201909001 , ESXi 6.5 U3 and ESXi 6.7 U3.

Workaround:
To workaround follow the below steps
Identify the VM which is part of the replication and notice the RDID for example

This is the replication ID of the disk: RDID-13d1285d-e660-4da9-8ffd-9e921a84ea2c
The corresponding replication group ID: GID-4f1df3b0-16fc-4e66-bddd-01ccc688a8d9

You can find the VM by checking the replication configuration of the VMs on the host:

$ vim-cmd hbrsvc/vmreplica.getConfig <vmID>

where <vmID> can be obtained from the list of the registered VMs:

$ vim-cmd vmsvc/getallvms

The replication ID should match GID-4f1df3b0-16fc-4e66-bddd-01ccc688a8d9

Then stop the replication for this VM.

Additional Information

Impact/Risks:
Stopping the replication of VM that caused the crash.