600+ VM hosted on a single backend PSA device will cause ESXi to PSOD
search cancel

600+ VM hosted on a single backend PSA device will cause ESXi to PSOD

book

Article ID: 375673

calendar_today

Updated On:

Products

VMware vSphere ESX 7.x VMware vSphere ESX 8.x

Issue/Introduction

  • ESXi is crashing with a purple screen (PSOD) similar to: 

Panic Details: Crash at yyyy-mm-ssTdd:mm:ss.msZ on CPU 87 running world 2099126 - HBReclaimHelperQueue. VMK Uptime:16:17:37:59.102
Panic Message: @BlueScreen: NMI IPI: Panic requested by another PCPU. RIPOFF(base):RBP:CS [0x4ac9a0(0x420022200000):0x3aae0c:0xf48] (Src 0x1, CPU87)
Backtrace:
  0x452a40572cf0:[0x4200222ff107]PanicvPanicInt@vmkernel#nover+0x327 stack: 0x452a40572dc8, 0x43043900bdd8, 0x4200222ff107, 0x4200227f4600, 0x452a40572cf0
  0x452a40572dc0:[0x4200222ff6b9]Panic_WithBacktrace@vmkernel#nover+0x56 stack: 0x452a40572e30, 0x452a40572de0, 0x452a40572e40, 0x452a40572df0, 0x4ac9a0
  0x452a40572e30:[0x4200222fbe0c]NMI_Interrupt@vmkernel#nover+0x561 stack: 0xa13b03196ebaa2ab, 0xf48, 0x260ee94b2a494521, 0xa90651a59302155a, 0x19eff3f43bcc9868
  0x452a40572f00:[0x420022353392]IDTNMIWork@vmkernel#nover+0x7f stack: 0x420055c00000, 0x4200223546dd, 0x28dff324657a4bbc, 0x452a40572fd0, 0x0
  0x452a40572f20:[0x4200223546dc]Int2_NMI@vmkernel#nover+0x19 stack: 0x0, 0x42002234e068, 0xf50, 0xf50, 0x0
  0x452a40572f40:[0x42002234e067]gate_entry@vmkernel#nover+0x68 stack: 0x0, 0x0, 0x0, 0x430f1e9e6480, 0x430ab7665af0
  0x453aba01bc20:[0x4200226ac9a0]vmk_ScsiInitTaskMgmt@vmkernel#nover+0x20 stack: 0x4ab, 0x430ab601da80, 0xccc644cf, 0x4200226343ec, 0x453aba01bccc
  0x453aba01bc50:[0x4200226343eb]SCSIDeviceVirtResetCommon@vmkernel#nover+0x170 stack: 0x430ab6120ce8, 0x200001388, 0x32, 0x430ab6120c40, 0x430ab601da80
  0x453aba01bcb0:[0x420022634b38]SCSI_DeviceVirtReset@vmkernel#nover+0x2c1 stack: 0x453a92d9f140, 0x632c225ae4bf, 0x430ab7663450, 0x4ab, 0x337a203594e44
  0x453aba01bd20:[0x420022620d8a]PsaScsi_DevIoctl@vmkernel#nover+0x7e7 stack: 0x2, 0x293c6, 0x0, 0x4200225be20d, 0x337a2035a7f4a
  0x453aba01bdf0:[0x420023388859][email protected]#nover+0x81e stack: 0x7e40, 0x0, 0x0, 0x0, 0x42004c000000
  0x453aba01bf10:[0x4200233d7b9f]FS3ReclaimHBCB@esx#nover+0x1a0 stack: 0xe593feb35c99c, 0x430369801220, 0x43199b9dfc20, 0x4200222da400, 0x0
  0x453aba01bf40:[0x4200222da3ff]HelperQueueFunc@vmkernel#nover+0x2d8 stack: 0x453aba020b48, 0x431998255d38, 0x453aba01f000, 0x0, 0x431998255dc0
  0x453aba01bfe0:[0x4200225b4d55]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0, 0x4200222c4de0, 0x0, 0x0, 0x0
  0x453aba01c000:[0x4200222c4ddf]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0, 0x0, 0x0, 0x0, 0x0
Saved backtrace from: pcpu 87 Heartbeat NMI
  0x453aba01bc20:[0x4200226ac99f]vmk_ScsiInitTaskMgmt@vmkernel#nover+0x20 stack: 0x4ab
  0x453aba01bc50:[0x4200226343eb]SCSIDeviceVirtResetCommon@vmkernel#nover+0x170 stack: 0x430ab6120ce8
  0x453aba01bcb0:[0x420022634b38]SCSI_DeviceVirtReset@vmkernel#nover+0x2c1 stack: 0x453a92d9f140
  0x453aba01bd20:[0x420022620d8a]PsaScsi_DevIoctl@vmkernel#nover+0x7e7 stack: 0x2
  0x453aba01bdf0:[0x420023388859][email protected]#nover+0x81e stack: 0x7e40
  0x453aba01bf10:[0x4200233d7b9f]FS3ReclaimHBCB@esx#nover+0x1a0 stack: 0xe593feb35c99c
  0x453aba01bf40:[0x4200222da3ff]HelperQueueFunc@vmkernel#nover+0x2d8 stack: 0x453aba020b48
  0x453aba01bfe0:[0x4200225b4d55]CpuSched_StartWorld@vmkernel#nover+0x86 stack: 0x0
  0x453aba01c000:[0x4200222c4ddf]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0

  • In the ESXi host /var/run/log/vmkernel.log you will see similar entries as given below

yyyy-mm-ssTdd:mm:ss.msZ cpu41:2098249)ScsiDeviceIO: 7136: Waiting for completion for all issued commands for partition naa.xyz:1. Already waited 5 secs. 4 completions still awaited.
yyyy-mm-ssTdd:mm:ss.msZ cpu10:2126869)WARNING: Heartbeat: 827: PCPU 41 didn't have a heartbeat for 49 seconds, timeout is 14, 3 IPIs sent; may be locked up.

Cause

The PCPU is entering the loop in SCSIDeviceDrainOutstandingCmds  and the PCPU on which this thread was running did not heartbeat for several seconds.

Resolution

This is a known issue in VMware ESXi 7.0.x.Currently there is no resolution.

This issue is fixed in VMware ESXi 8.0 Update 3. To download go to Download Broadcom Products and Software