Impact/Risks:
Hosts may suddenly experience a purple diagnostic screen (PSOD)
If no PSOD is experienced network connections may disconnect impacting production traffic. If storage traffic is impacted virtual machines may crash.
Symptoms:
*details such as time stamps and identifiers will be different in each environment*
Running an ESXi build below 7.0 U3g (20328353)
See log messages similar to the following:
vobd.log:
2024-10-10T16:19:08.243Z: [cpuCorrelator] 61858276897876us: [vob.cpu.nmi.ipi.savebt] NMI IPI: RIPOFF(base):RBP:CS [0xad18f(0x420031400000):0:0xf48] (Src 0x1, CPU48)
vmkernel.log:
2024-10-10T16:22:36.297Z cpu56:2097263)WARNING: Uplink: 21014: Queue 9 of device vmnic9 stuck, resetting the device
2024-10-10T16:22:49.230Z cpu7:10491112)WARNING: Heartbeat: 827: PCPU 51 didn't have a heartbeat for 8 seconds, timeout is 14, 1 IPIs sent; *may* be locked up.
2024-10-10T16:22:49.230Z cpu51:2098227)ALERT: NMI: 710: NMI IPI: RIPOFF(base):RBP:CS [0x10484d(0x420031400000):0x42004cc00680:0xf48] (Src 0x1, CPU51)
2024-10-10T16:22:49.230Z cpu51:2098227)0x45395e11bd10:[0x42003150484c]MCSLockSpin@vmkernel#nover+0x41 stack: 0x43022a304320
2024-10-10T16:22:49.230Z cpu51:2098227)0x45395e11bd40:[0x420031504eaa]MCSLockWait@vmkernel#nover+0x153 stack: 0x0
2024-10-10T16:22:49.230Z cpu51:2098227)0x45395e11bd60:[0x42003150538a]MCSLockIRQWork@vmkernel#nover+0x4f stack: 0x3c4ef6a900000000
2024-10-10T16:22:49.230Z cpu51:2098227)0x45395e11bd80:[0x4200315486d8]FastSlabCreateObj@vmkernel#nover+0x11d stack: 0x100000002
2024-10-10T16:22:49.230Z cpu51:2098227)0x45395e11be10:[0x420031548d0a]FastSlabReplenishCPU@vmkernel#nover+0x83 stack: 0x1
2024-10-10T16:22:49.230Z cpu51:2098227)0x45395e11be60:[0x42003154a0fb]FastSlabAllocSlow@vmkernel#nover+0x84 stack: 0xc
2024-10-10T16:22:49.230Z cpu51:2098227)0x45395e11be80:[0x4200315efd82]Pkt_SlabAllocPkt@vmkernel#nover+0x133 stack: 0x1f
2024-10-10T16:22:49.230Z cpu51:2098227)0x45395e11bec0:[0x4200315ef291]Pkt_AllocHandleWithSize@vmkernel#nover+0xee stack: 0x452182bc2240
2024-10-10T16:22:49.230Z cpu51:2098227)0x45395e11bee0:[0x4200315ef580]Pkt_AllocWithFlags@vmkernel#nover+0xd stack: 0x112
2024-10-10T16:22:49.230Z cpu51:2098227)0x45395e11bf00:[0x4200316a3ada]vmk_PktAlloc@vmkernel#nover+0x1f stack: 0x2f1b00
2024-10-10T16:22:49.230Z cpu51:2098227)0x45395e11bf10:[0x420032445055]nmlx5_en_PostRxWqes@(nmlx5_core)#<None>+0xc2 stack: 0x431559aa9d10
2024-10-10T16:22:49.230Z cpu51:2098227)0x45395e11bf40:[0x420032444d4d]nmlx5_en_NetPollCB@(nmlx5_core)#<None>+0x5a stack: 0x1
2024-10-10T16:22:49.230Z cpu51:2098227)0x45395e11bf70:[0x4200316a622f]NetPollWorldCallback@vmkernel#nover+0x190 stack: 0x36
2024-10-10T16:22:49.230Z cpu51:2098227)0x45395e11bfe0:[0x4200317b290d]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0
2024-10-10T16:22:49.230Z cpu51:2098227)0x45395e11c000:[0x4200314c4b8f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
2024-10-10T16:22:40.230Z cpu32:220998598)WARNING: Heartbeat: 827: PCPU 50 didn't have a heartbeat for 7 seconds, timeout is 14, 1 IPIs sent; *may* be locked up.
2024-10-10T16:22:40.230Z cpu42:2098215)WARNING: Heartbeat: 827: PCPU 58 didn't have a heartbeat for 7 seconds, timeout is 14, 1 IPIs sent; *may* be locked up.
2024-10-10T16:22:40.230Z cpu35:197905059)WARNING: Heartbeat: 827: PCPU 53 didn't have a heartbeat for 7 seconds, timeout is 14, 1 IPIs sent; *may* be locked up.
2024-10-10T16:22:40.230Z cpu58:2098204)ALERT: NMI: 710: NMI IPI: RIPOFF(base):RBP:CS [0x104850(0x420031400000):0x42004e800680:0xf48] (Src 0x1, CPU58)
2024-10-10T16:22:40.230Z cpu53:2098205)ALERT: NMI: 710: NMI IPI: RIPOFF(base):RBP:CS [0x10484d(0x420031400000):0x42004d400680:0xf48] (Src 0x1, CPU53)
2024-10-10T16:22:40.230Z cpu50:2098347)ALERT: NMI: 710: NMI IPI: RIPOFF(base):RBP:CS [0x10484d(0x420031400000):0x42004c800680:0xf48] (Src 0x1, CPU50)
2024-10-10T16:22:40.230Z cpu50:2098347)0x453961c9bd10:[0x42003150484c]MCSLockSpin@vmkernel#nover+0x41 stack: 0x43022a304320
2024-10-10T16:22:40.230Z cpu53:2098205)0x45395d69bd10:[0x42003150484c]MCSLockSpin@vmkernel#nover+0x41 stack: 0x43022a304320
2024-10-10T16:22:40.230Z cpu58:2098204)0x45395d61bd10:[0x42003150484f]MCSLockSpin@vmkernel#nover+0x44 stack: 0x43022a304320
2024-10-10T16:22:40.230Z cpu50:2098347)0x453961c9bd40:[0x420031504eaa]MCSLockWait@vmkernel#nover+0x153 stack: 0x0
2024-10-10T16:22:40.230Z cpu53:2098205)0x45395d69bd40:[0x420031504eaa]MCSLockWait@vmkernel#nover+0x153 stack: 0x0
2024-10-10T16:22:40.230Z cpu58:2098204)0x45395d61bd40:[0x420031504eaa]MCSLockWait@vmkernel#nover+0x153 stack: 0x0
2024-10-10T16:22:40.230Z cpu50:2098347)0x453961c9bd60:[0x42003150538a]MCSLockIRQWork@vmkernel#nover+0x4f stack: 0x3c4ef6a900000000
2024-10-10T16:22:40.230Z cpu58:2098204)0x45395d61bd60:[0x42003150538a]MCSLockIRQWork@vmkernel#nover+0x4f stack: 0x3c4ef6a900000000
2024-10-10T16:22:40.230Z cpu53:2098205)0x45395d69bd60:[0x42003150538a]MCSLockIRQWork@vmkernel#nover+0x4f stack: 0x3c4ef6a900000000
2024-10-10T16:22:40.230Z cpu50:2098347)0x453961c9bd80:[0x4200315486d8]FastSlabCreateObj@vmkernel#nover+0x11d stack: 0x100000002
2024-10-10T16:22:40.230Z cpu53:2098205)0x45395d69bd80:[0x4200315486d8]FastSlabCreateObj@vmkernel#nover+0x11d stack: 0x100000002
2024-10-10T16:22:40.230Z cpu58:2098204)0x45395d61bd80:[0x4200315486d8]FastSlabCreateObj@vmkernel#nover+0x11d stack: 0x100000002
2024-10-10T16:22:40.230Z cpu50:2098347)0x453961c9be10:[0x420031548d0a]FastSlabReplenishCPU@vmkernel#nover+0x83 stack: 0x1
2024-10-10T16:22:40.230Z cpu53:2098205)0x45395d69be10:[0x420031548d0a]FastSlabReplenishCPU@vmkernel#nover+0x83 stack: 0x1
2024-10-10T16:22:40.230Z cpu58:2098204)0x45395d61be10:[0x420031548d0a]FastSlabReplenishCPU@vmkernel#nover+0x83 stack: 0x1
2024-10-10T16:22:40.230Z cpu50:2098347)0x453961c9be60:[0x42003154a0fb]FastSlabAllocSlow@vmkernel#nover+0x84 stack: 0xc
2024-10-10T16:22:40.230Z cpu53:2098205)0x45395d69be60:[0x42003154a0fb]FastSlabAllocSlow@vmkernel#nover+0x84 stack: 0xc
2024-10-10T16:22:40.230Z cpu58:2098204)0x45395d61be60:[0x42003154a0fb]FastSlabAllocSlow@vmkernel#nover+0x84 stack: 0xc
2024-10-10T16:22:40.230Z cpu50:2098347)0x453961c9be80:[0x4200315efd82]Pkt_SlabAllocPkt@vmkernel#nover+0x133 stack: 0x1f
2024-10-10T16:22:40.230Z cpu50:2098347)0x453961c9bec0:[0x4200315ef291]Pkt_AllocHandleWithSize@vmkernel#nover+0xee stack: 0x45213cec4f20
2024-10-10T16:22:40.230Z cpu53:2098205)0x45395d69be80:[0x4200315efd82]Pkt_SlabAllocPkt@vmkernel#nover+0x133 stack: 0x1f
2024-10-10T16:22:40.230Z cpu58:2098204)0x45395d61be80:[0x4200315efd82]Pkt_SlabAllocPkt@vmkernel#nover+0x133 stack: 0x1f
2024-10-10T16:22:40.230Z cpu50:2098347)0x453961c9bee0:[0x4200315ef580]Pkt_AllocWithFlags@vmkernel#nover+0xd stack: 0x279
2024-10-10T16:22:40.230Z cpu53:2098205)0x45395d69bec0:[0x4200315ef291]Pkt_AllocHandleWithSize@vmkernel#nover+0xee stack: 0x4521375f4d80
2024-10-10T16:22:40.230Z cpu58:2098204)0x45395d61bec0:[0x4200315ef291]Pkt_AllocHandleWithSize@vmkernel#nover+0xee stack: 0x4521375e6aa0
2024-10-10T16:22:40.230Z cpu50:2098347)0x453961c9bf00:[0x4200316a3ada]vmk_PktAlloc@vmkernel#nover+0x1f stack: 0xeae00
2024-10-10T16:22:40.230Z cpu53:2098205)0x45395d69bee0:[0x4200315ef580]Pkt_AllocWithFlags@vmkernel#nover+0xd stack: 0x26c
2024-10-10T16:22:40.230Z cpu58:2098204)0x45395d61bee0:[0x4200315ef580]Pkt_AllocWithFlags@vmkernel#nover+0xd stack: 0x355
2024-10-10T16:22:40.230Z cpu53:2098205)0x45395d69bf00:[0x4200316a3ada]vmk_PktAlloc@vmkernel#nover+0x1f stack: 0xf4e00
2024-10-10T16:22:40.230Z cpu50:2098347)0x453961c9bf10:[0x420032445055]nmlx5_en_PostRxWqes@(nmlx5_core)#<None>+0xc2 stack: 0x431559ca3ae0
2024-10-10T16:22:40.230Z cpu58:2098204)0x45395d61bf00:[0x4200316a3ada]vmk_PktAlloc@vmkernel#nover+0x1f stack: 0xf4e00
2024-10-10T16:22:40.230Z cpu50:2098347)0x453961c9bf40:[0x420032444d4d]nmlx5_en_NetPollCB@(nmlx5_core)#<None>+0x5a stack: 0x1
2024-10-10T16:22:40.230Z cpu53:2098205)0x45395d69bf10:[0x420032445055]nmlx5_en_PostRxWqes@(nmlx5_core)#<None>+0xc2 stack: 0x431559a55910
2024-10-10T16:22:40.230Z cpu58:2098204)0x45395d61bf10:[0x420032445055]nmlx5_en_PostRxWqes@(nmlx5_core)#<None>+0xc2 stack: 0x431559a55340
2024-10-10T16:22:40.230Z cpu50:2098347)0x453961c9bf70:[0x4200316a622f]NetPollWorldCallback@vmkernel#nover+0x190 stack: 0x36
2024-10-10T16:22:40.230Z cpu58:2098204)0x45395d61bf40:[0x420032444d4d]nmlx5_en_NetPollCB@(nmlx5_core)#<None>+0x5a stack: 0x1
2024-10-10T16:22:40.230Z cpu53:2098205)0x45395d69bf40:[0x420032444d4d]nmlx5_en_NetPollCB@(nmlx5_core)#<None>+0x5a stack: 0x1
2024-10-10T16:22:40.230Z cpu50:2098347)0x453961c9bfe0:[0x4200317b290d]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0
2024-10-10T16:22:40.230Z cpu58:2098204)0x45395d61bf70:[0x4200316a622f]NetPollWorldCallback@vmkernel#nover+0x190 stack: 0x36
2024-10-10T16:22:40.230Z cpu53:2098205)0x45395d69bf70:[0x4200316a622f]NetPollWorldCallback@vmkernel#nover+0x190 stack: 0x36
2024-10-10T16:22:40.230Z cpu50:2098347)0x453961c9c000:[0x4200314c4b8f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
2024-10-10T16:22:40.230Z cpu58:2098204)0x45395d61bfe0:[0x4200317b290d]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0
2024-10-10T16:22:40.230Z cpu53:2098205)0x45395d69bfe0:[0x4200317b290d]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0
2024-10-10T16:22:40.230Z cpu58:2098204)0x45395d61c000:[0x4200314c4b8f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
2024-10-10T16:22:40.230Z cpu53:2098205)0x45395d69c000:[0x4200314c4b8f]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
2024-10-10T16:22:41.230Z cpu46:197905059)WARNING: Heartbeat: 827: PCPU 66 didn't have a heartbeat for 7 seconds, timeout is 14, 1 IPIs sent; *may* be locked up.
2024-10-10T16:22:41.230Z cpu66:2098348)ALERT: NMI: 710: NMI IPI: RIPOFF(base):RBP:CS [0x105381(0x420031400000):0x8:0xf48] (Src 0x1, CPU66)
2024-10-10T16:22:41.230Z cpu66:2098348)0x453961d1b720:[0x420031505380]MCSLockIRQWork@vmkernel#nover+0x45 stack: 0x3
Experience one or both of the following:
Network disconnections (this issue has primarily been found interrupting storage network connections such as NFS and vSAN, but may impact any network traffic).
- This can be seen in vobd.log with messages similar to:
2024-10-10T16:19:08.243Z: [netCorrelator] 61858277913312us: [vob.net.dvport.uplink.transition.down] Uplink: vmnic5 is down. Affected dvPort: 572/50 ## ## ## ## ## ## ##-## ## ## ## ## ## ## ##. 1 uplinks up. Failed criteria: 128
PSOD with same trace as seen in vmkernel.log message above.
ESXi build below 7.0 P04/U3g (20328353)
The fastslab used by packet slab does not pre-reserve memory. When memory pressure is present and the fastslab cannot obtain the memory required to provide a situation develops where a CPU hangs with a memory overallocation.
There is no workaround for this issue, or method to reliably monitor to determine risk of encountering at any specific point in time.
The resolution is to update ESXi to 7.0 P04/U3g (20328353) or higher.