ESXi 7.0 Update 3 host fails with a backtrace NMI IPI: Panic requested by another PCPU.

Products

VMware vSphere ESXi

Issue/Introduction

To provide information on how best to avoid an ESXi 7.0 Update 3 host freezing at the purple diagnostic screen described in the symptoms section of this document.

Symptoms:

Recent upgrade to ESXi 7.0.3 Build: 18644231
Thin provisioned virtual disks (VMDKs) residing on VMFS6 datastores, may cause multiple hosts in an HA cluster to fail with a purple diagnostic screen.
In the /var/run/log/vmkernel.* file, you see the entries similar to:

2021-10-20T03:11:41.679Z cpu0:2352732)@BlueScreen: NMI IPI: Panic requested by another PCPU. RIPOFF(base):RBP:CS [0x1404f8(0x420004800000):0x12b8:0xf48] (Src 0x1, CPU0)
2021-10-20T03:11:41.689Z cpu0:2352732)Code start: 0x420004800000 VMK uptime: 11:07:27:23.196
2021-10-20T03:11:41.697Z cpu0:2352732)Saved backtrace from: pcpu 0 Heartbeat NMI
2021-10-20T03:11:41.715Z cpu0:2352732)0x45394629b8b8:[0x4200049404f7]HeapVSIAddChunkInfo@vmkernel#nover+0x1b0 stack: 0x420005bd611e
2021-10-20T03:11:41.734Z cpu0:2352732)0x45394629b8c0:[0x420004943036]Heap_AlignWithTimeoutAndRA@vmkernel#nover+0x1eb stack: 0x431822b49000
2021-10-20T03:11:41.750Z cpu0:2352732)0x45394629b940:[0x420005bd611d]J6_NewOnDiskTxn@esx#nover+0x15a stack: 0x43181f200560
2021-10-20T03:11:41.764Z cpu0:2352732)0x45394629b9a0:[0x420005bd667d]J6CommitInMemTxn@esx#nover+0x176 stack: 0x1
2021-10-20T03:11:41.781Z cpu0:2352732)0x45394629ba50:[0x420005bd318a]J6_CommitMemTransaction@esx#nover+0xe3 stack: 0x1e9400000037
2021-10-20T03:11:41.795Z cpu0:2352732)0x45394629baa0:[0x420005bf8ad4]Fil6_UnmapTxn@esx#nover+0x4fd stack: 0x0
2021-10-20T03:11:41.809Z cpu0:2352732)0x45394629bbb0:[0x420005bfc891]Fil6UpdateBlocks@esx#nover+0x4e2 stack: 0xff
2021-10-20T03:11:41.824Z cpu0:2352732)0x45394629bc30:[0x420005bbc3fe]Fil3UpdateBlocks@esx#nover+0xeb stack: 0x21f9b800
2021-10-20T03:11:41.842Z cpu0:2352732)0x45394629bd30:[0x420005bbd425]Fil3_PunchFileHoleWithRetry@esx#nover+0x7e stack: 0x45394629bec8
2021-10-20T03:11:41.859Z cpu0:2352732)0x45394629bde0:[0x420005bbdc0d]Fil3_FileBlockUnmap@esx#nover+0x57e stack: 0x43181eeddfd0
2021-10-20T03:11:41.877Z cpu0:2352732)0x45394629be90:[0x42000483b5fb]FSSVec_FileBlockUnmap@vmkernel#nover+0x20 stack: 0x45b9414506e0
2021-10-20T03:11:41.894Z cpu0:2352732)0x45394629bea0:[0x420004d52c03]VSCSI_ExecFSSUnmap@vmkernel#nover+0x9c stack: 0x430cbe01c170
2021-10-20T03:11:41.911Z cpu0:2352732)0x45394629bf10:[0x420004d50ead]VSCSIDoEmulHelperIO@vmkernel#nover+0x2a stack: 0x430cbe001818
2021-10-20T03:11:41.928Z cpu0:2352732)0x45394629bf40:[0x4200048d9c19]HelperQueueFunc@vmkernel#nover+0x1d2 stack: 0x4539462a0b48
2021-10-20T03:11:41.944Z cpu0:2352732)0x45394629bfe0:[0x420004bb1775]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0
2021-10-20T03:11:41.959Z cpu0:2352732)0x45394629c000:[0x4200048c46ff]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
2021-10-20T03:11:41.970Z cpu0:2352732)base fs=0x0 gs=0x420040000000 Kgs=0x0
2021-10-20T03:11:41.975Z cpu0:2352732)1 other PCPU is in panic.
2021-10-20T03:11:13.546Z cpu0:2352732)NMI: 689: NMI IPI: RIPOFF(base):RBP:CS [0x144c3b(0x420004800000):0x12c0:0xf48] (Src 0x1, CPU0)
2021-10-20T03:11:00.545Z cpu0:2352732)NMI: 689: NMI IPI: RIPOFF(base):RBP:CS [0x104eff(0x420004800000):0x0:0xf48] (Src 0x1, CPU0)

Note: The preceding log excerpts are only examples. Date,time and environmental variables may vary depending on your environment.

Environment

VMware vSphere ESXi 7.0
VMware vSphere 7.0.x

Cause

In ESXi 7.0.3 release VMFS added a change to have uniform UNMAP granularities across VMFS & SE Sparse snapshot. As a part of this change maximum UNMAP granularity reported by VMFS was adjusted to 2GB. A TRIM/UNMAP request of 2GB issued from Guest OS can in rare situations result in a VMFS metadata transaction requiring lock acquisition of a large number of resource clusters (greater then 50 resources) which is not handled correctly in resulting in an ESXi PSOD. VMFS metadata transaction requiring lock actions on greater then 50 resource clusters is not common and can happen on aged datastores. This concern only impacts Thin Provisioned VMDKs, Thick, and Eager Zero Thick VMDKs are not impacted.

Resolution

This issue is resolved in ESXi 7.0 Update 3c
The issue was also resolved in ESXi 7.0 U3a and U3b (No longer available)

Workaround:

Below are the detailed workaround steps:

There are a few options that customers have to work around this issue. Please note that any of these workarounds will prevent the issue from happening, customers only need to choose the workaround that is best for their situation.

1. Revert to the previous version of ESXi that is not impacted by this concern.
REF: https://knowledge.broadcom.com/external/article/316592/reverting-to-a-previous-version-of-esxi.html

2. Convert thin VMDKs to Thick or EZT provisioning
REF: Determine the Virtual Disk Format and Convert a Virtual Disk from the Thin Provision Format to a Thick Provision Format
Inflate Thin Virtual Disks

3. Disable TRIM/UNMAP in the Guest OS
Note: Please consult OS documentation on how to adjust TRIM/UNMAP features for a complete understanding of the OS specific configurations needs. Note: functions and capabilities may vary across distributions and versions based on OS specifics.
Examples: https://www.suse.com/support/kb/doc/?id=000019447