Intermittent PSOD is experienced in ESX 8.0GA(build 20513097) and ESX8.0U1(build 21495797).
YYYY-MM-DDTHH:MM:SS.514Z cpu40:2706894)@BlueScreen: #PF Exception 14 in world 2706894:NetWorld-VM- IP 0x############ addr 0x3c
PTEs:0x16b741023;0x307fec0023;0x307fec1023;0x0;
YYYY-MM-DDTHH:MM:SS.514Z cpu40:2706894)Code start: 0x420023c00000 VMK uptime: 2:02:01:22.715
YYYY-MM-DDTHH:MM:SS.514Z cpu40:2706894)0x4538fc81be78:[0x420023e974c9]PktList_SplitByUplinkPort@vmkernel#nover+0x9 stack: 0x0
YYYY-MM-DDTHH:MM:SS.515Z cpu40:2706894)0x4538fc81be80:[0x420023e975c7]PktList_IOCompleteLocked@vmkernel#nover+0xdc stack: 0x0
YYYY-MM-DDTHH:MM:SS.515Z cpu40:2706894)0x4538fc81bef0:[0x420023ea77de]Portset_ProcessAllDeferred@vmkernel#nover+0x2b stack: 0x6000023
YYYY-MM-DDTHH:MM:SS.515Z cpu40:2706894)0x4538fc81bf10:[0x420023ea5ad0]Port_ReleaseNonexcl@vmkernel#nover+0x13d stack: 0x1
YYYY-MM-DDTHH:MM:SS.515Z cpu40:2706894)0x4538fc81bf50:[0x420023e89b03]NetWorldPerVMCB@vmkernel#nover+0x1a8 stack: 0x430093e9af90
YYYY-MM-DDTHH:MM:SS.516Z cpu40:2706894)0x4538fc81bfe0:[0x42002401d766]CpuSched_StartWorld@vmkernel#nover+0x7b stack: 0x0
YYYY-MM-DDTHH:MM:SS.516Z cpu40:2706894)0x4538fc81c000:[0x420023cd4d1f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
ESXI 8.x
This issue causes a packet completion routine access invalid pointer and PSODs the host
The fix is available in ESXi 8.0 Update 1c (build 22088125)
The fix is available in NSXT 4.1.1.
Workaround:
In cases where the resolution is not immediately applicable, a workaround can be employed. In this case, disabling Large Acceptance Test (LAT) is a viable workaround, as it prevents packets from going read-only and the host from reaching a PSOD state. It's worth noting that there are other ways for packets to go read-only, and disabling Geneve offload provides a more comprehensive solution.
a. Disable Geneve offload:
Pros: I. Prevents the same issue triggered by other offload constraints, such as IPv6 outer and inner L7 offset. II. Allows features like LAT to remain enabled.
Cons: I. Performance impact, as Geneve inner checksum and TCP segmentation will be done in software, resulting in approximately 10+% CPU overhead on the transmit side. II. Operational inconvenience, as it requires running an esxcli command on each host.
b. Disable LAT feature:
Pros: I. No performance impact on data traffic. II. Easy to operate, as it only needs to be disabled in one place.
Cons: I. The LAT feature will be disabled. II. Cannot prevent the same issue from being triggered by other constraints, such as when other features generate more than a 32-byte Geneve option.