Multiple PSODs on consecutive PCPUs for the same physical processor core #PF Exception 14
search cancel

Multiple PSODs on consecutive PCPUs for the same physical processor core #PF Exception 14

book

Article ID: 422269

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • The host crashes with the below trace "PCPU" : 0x12345678

Code start: 0x420028400000 VMK
8x45399f19b2c0:[0x4200287bfb99]CpuSchedMigrateGoodness@vmkernel#nover+8xa3d stack: 0x45399f19b4c8
8x45399f19b360:[8x4208287c169c]CpuSched_VcpuMigrateBestPcpu@vnkernel#nover+8x4ad stack: 8xad452d77044c
8x45399f19b6d8:[8x4288287af8fe]CpuSchedVcpuMakeReady@vnkernel#nover.Bxdf stack: 8x453996d1f980
Bx45399f19b6f8:[8x4288287af9d2]CpuSchedWakeupHorld@vnkernel#nover.0x93 stack: 0xad4588888865
8x45399f19b740:[8x4200287afdb1]CpuSchedHakeupHorldInt@vnkernel#nover.0x18a stack: 8x45394cf9f100
0x45399f19b870:[8x4200287affef ]CpuSchedSleepTincout@vmkernel#nover.0x24 stack: 0x0
Bx45399f19b890:[0x42082850f56f]TimerWheelHandler@vmkernel#nover+0x88 stack: 0x420044405da0
8x45399f19b918:[8x42802858f7b3]Timer_BHHandler@vnkernel#nover+8x48 stack: 0x420844400800
8x45399f19b940:[8x4288284c044e]BH_DrainAndDisableInterrupts@vnkernel#nover+8x97 stack: 0x453956c9f940
8x45399f19b9c0:[8x4288284dfeaa]IntrCookie_VnkernelInterrupt@vnkernel#nover.8xb3 stack: 0xffffffffffffffef
8x45399f19b9e0:[8x420028555aac]IDT_IntrHandler@vmkernel#nover.8x9d stack: 8x8
Bx45399f19ba00:[0x42002854e067]gate_entry@vnkernel#nover*0x68 stack: 0x0
0x45399f19bac8:[0x420028484814]Power_ArchPerformHait@vnkernel#nover+0x78 stack: 0x420844408988
Bx45399f19bad0: [0x420028484982]Power_ArchSetCState@vnkernel#nover.0x8f stack: 0x0
Bx45399f19bb28:[8x4208287af 368]CpuSchedIdleLoopInt@vnkernel#nover.8x275 stack: 8x420044408108
8x45399f19bb98:[8x4208287b342e]CpuSchedDispatch@vmkernel#nover.0xlaff stack: 0x428844480140
8x45399f19bdd0: [8x4208287b4183]CpuSchedHait@vnkernel#nover.0x2f4 stack: 0x7
0x45399f19bf50:[0x4200287b471a]CpuSched_VcpuHalt@vmkernel#nover+0x13f stack: 0x45399f19f000
8x45399f19bfa0:[0x42002852d4fb]VMMVMKCall_Call@vmkernel#nover+0x108 stack: 8x0
Bx45399f19bfe0:[0x420828559489]VMKVMM_ArchEnterVMKernel@vmkernel#nover+8xe stack: 8x42002855947c
base fs=0x0 gs=8x428844400008 Kgs=8x8

 

  • /var/run/log/LogEFI.log

YYYY-MM-DDTHH:MM:SS.765Z cpu16:2107401)Backtrace for current CPU #16, worldID=2107401, fp=0x45399ce1b350
YYYY-MM-DDTHH:MM:SS.765Z cpu16:2107401)0x45399ce1b2c0:[0X1234fbfb99]CpuSchedMigrateGoodness@vmkernel#nover+0xa3d stack: 0x3ffffff, 0x4302XXXXXXXX, 0x42XXXXXXXX00, 0x45399ce1b4fc, 0x0
YYYY-MM-DDTHH:MM:SS.765Z cpu16:2107401)0x45399ce1b360:[0X1234fc169c]CpuSched_VcpuMigrateBestPcpu@vmkernel#nover+0x4ad stack: 0x5aXXXXX, 0x58XXXXXXXX, 0x4XXXXXX, 0x0, 0xfffffffffe

[0X1234fa8db2]CpuSchedHaltMonTimerCB@vmkernel#nover+0x8b stack: 0X1234fa8d28, 0X1XXXXXX70, 0x420XXXXXX0, 0x4200XXX, 0x4XXXXXX0
YYYY-MM-DDTHH:MM:SS.765Z cpu16:2107401)0x45399ce1b890:[0X1234d0f56f]TimerWheelHandler@vmkernel#nover+0x88 stack: 0x420044XXXX, 0x5aca9bcXXXX0, 0x1, 0X1234d08b22, 0x45399ce1fXXXX
YYYY-MM-DDTHH:MM:SS.765Z cpu16:2107401)0x45399ce1b910:[0X1234d0f7b3]Timer_BHHandler@vmkernel#nover+0x48 stack: 0x420044XXXX00, 0xef, 0x0, 0X1234cc044f, 0x42XXXXXXXXef

YYYY-MM-DDTHH:MM:SS406Z cpu17:2110682)Backtrace for current CPU #17, worldID=2110682, fp=0x45395759b9f0
YYYY-MM-DDTHH:MM:SS406Z cpu17:2110682)0x45395759b960:[0x42002c7bfadb]CpuSchedMigrateGoodness@vmkernel#nover+0x97f stack: 0x1004dfbec, 0x0, 0x5, 0x45395759bb9c, 0x45395759bb88
YYYY-MM-DDTHH:MM:SS406Z cpu17:2110682)0x45395759ba00:[0x42002c8b7fef]User_VMKCallSemaSignal@vmkernel#nover+0x18 stack: 0x453991d9f140, 0x4200XXXX2d4fc, 0x45395759fXXXX, 0xfffffffffc406a08, 0x0
YYYY-MM-DDTHH:MM:SS406Z cpu17:2110682)0x45395759bfa0:[0x4200XXXX2d4fb]VMMVMKCall_Call@vmkernel#nover+0x108 stack: 0x0, 0x0, 0x400, 0x292, 0x78ddc1XXXX
YYYY-MM-DDTHH:MM:SS406Z cpu17:2110682)0x45395759bfe0:[0x4200XXXX9489]VMKVMM_ArchEnterVMKernel@vmkernel#nover+0xe stack: 0x4200XXXX947c, 0xfffffffffc06b29a, 0x0, 0x0, 0x0

Environment

  • ESXi 7.x  

Cause

the PSODs happen because the processor loaded wrong value of gc->period[0] by accessing wrong address with the one bit flip of the expected address. 

Resolution

Engage the hardware vendor to check and remediate the affected physical CPU.

Additional Information

 https://knowledge.broadcom.com/external/article/402465/