To avoid crashing the ESXi host with a PSOD (purple screen of death).
Symptoms:
Panic Details: Crash at YYYY-MM-DDTHH:MM:SS.825Z on CPU 0 running world 2102090. VMK Uptime:72:07:15:03.736
Panic Message: @BlueScreen: NMI IPI: Panic requested by another PCPU. RIPOFF(base):RBP:CS [0xc42ce(0x418002400000):0x43190b62a500:0xfc8] (Src 0x4, CPU0)
0x450a00002d10:[0x41800250ac15]PanicvPanicInt@vmkernel#nover+0x439 stack: 0x418002889fe0, 0x418002889f28, 0x450a00002db8, 0x43026acd2028, 0x450a00000001
0x450a00002db0:[0x41800250aea1]Panic_WithBacktrace@vmkernel#nover+0x56 stack: 0x450a00002e20, 0x450a00002dd0, 0x0, 0x0, 0xc42ce
0x450a00002e20:[0x418002507c91]NMI_Interrupt@vmkernel#nover+0x3c2 stack: 0x0, 0xfc8, 0x5320302075706370, 0x206b636f4c6e6970, 0x74756f206e697073
0x450a00002ea0:[0x418002543ffc]IDTNMIWork@vmkernel#nover+0x99 stack: 0x0, 0x0, 0x0, 0x0, 0x0
0x450a00002f20:[0x4180025454f0]Int2_NMI@vmkernel#nover+0x19 stack: 0x0, 0x418002560067, 0xfd0, 0xfd0, 0x0
0x450a00002f40:[0x418002560066]gate_entry@vmkernel#nover+0x67 stack: 0x0, 0x0, 0xf, 0x1c0477, 0x0
0x451ac901be88:[0x4180024c42ce]BitVector_NextBit@vmkernel#nover+0x46 stack: 0x41800385a75f, 0x16b7487, 0x432099c4dfd0, 0x43206200a730, 0x43209a0c8c30
0x451ac901be98:[0x4180024c451b]BitVector_NextExtent@vmkernel#nover+0x4c stack: 0x432099c4dfd0, 0x43206200a730, 0x43209a0c8c30, 0x432062092050, 0x41800386f366
0x451ac901bed0:[0x41800386f365]TransferDispatchExtent@(hbr_filter)#<None>+0xb2 stack: 0x418003870c3b, 0x418002514853, 0x8818c0, 0x4180025502c7, 0x451ac90232c0
0x451ac901bf80:[0x418003870a99]ResourceWorld@(hbr_filter)#<None>+0xa2 stack: 0x43209a0c8c6c, 0x417fd92021c0, 0x0, 0x451ac9023000, 0x451ac6123100
0x451ac901bfe0:[0x418002709112]CpuSched_StartWorld@vmkernel#nover+0x77 stack: 0x0, 0x0, 0x0, 0x0, 0x0
Saved backtrace from: pcpu 0 SpinLock spin out NMI
0x451ac901be88:[0x4180024c42cd]BitVector_NextBit@vmkernel#nover+0x46 stack: 0x41800385a75f
0x451ac901be98:[0x4180024c451b]BitVector_NextExtent@vmkernel#nover+0x4c stack: 0x432099c4dfd0
0x451ac901bed0:[0x41800386f365]TransferDispatchExtent@(hbr_filter)#<None>+0xb2 stack: 0x418003870c3b
0x451ac901bf80:[0x418003870a99]ResourceWorld@(hbr_filter)#<None>+0xa2 stack: 0x43209a0c8c6c
0x451ac901bfe0:[0x418002709112]CpuSched_StartWorld@vmkernel#nover+0x77 stack: 0x0
VMware vSphere ESXi 6.0
VMware vSphere ESXi 6.5
VMware vSphere ESXi 7.0.0
VMware vSphere Replication 6.x
VMware vSphere Replication 8.x
VMware vSphere ESXi 6.7
hbr_filter searches for a whole contiguous region in the transfer bitmap. This usually works well when the regions are small. When the regions are large enough (for ex. when full syncing large disk with checksumming disabled), iterating them may result in PSOD (because the disk lock is held for long time, this way exceeding the spin count of other contending cpu's)
This issue is resolved in VMware vSphere ESXi 6.0 Patch ESXi600-201909001 , ESXi 6.5 U3 and ESXi 6.7 U3.
Workaround:
To workaround follow the below steps
Identify the VM which is part of the replication and notice the RDID for example
This is the replication ID of the disk: RDID-13d1285d-e660-4da9-8ffd-9e921a84ea2c
The corresponding replication group ID: GID-4f1df3b0-16fc-4e66-bddd-01ccc688a8d9
You can find the VM by checking the replication configuration of the VMs on the host:
$ vim-cmd hbrsvc/vmreplica.getConfig <vmID>
where <vmID> can be obtained from the list of the registered VMs:
$ vim-cmd vmsvc/getallvms
The replication ID should match GID-4f1df3b0-16fc-4e66-bddd-01ccc688a8d9
Then stop the replication for this VM.