Rolling PSOD issue observed in vSAN ESA 8.0U2 cluster
search cancel

Rolling PSOD issue observed in vSAN ESA 8.0U2 cluster

book

Article ID: 380671

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Host in a vSAN ESA cluster PSODs with the below backtrace

Backtrace:
VMware ESXi 8.0.2 [Relcosebuild-22380479 x86_64] VERIFY bora/vsan/zdo/libzdom/zdomSegCache.c:454

Module(s) involved in panic: [vsan Version (19).0, Built on: Sep 4 20231
cr0=0x80010031 cr2=0x5a5480 cr3=0x20e000 cr4=0x142768
FMS=86/8f/8 uCode=0x2b000461
MPCPU4:2098869/zDOM-com p1/t0
PCPU 0: SSSSSSSSSSSSSSSSSUSUSSSSSSSSSSSSSSSSUSSSSUSSSSUSUSSSUSSSUSUUSSSU
Code start: 0x420013600000 VMK uptime: 0:00:03:03.007
Bx45392d09b400:[0x420013719b5a]PanicvPanicInt@vmkernel#nover+0x202 stack: 8x420015798e80
Bx45392d09b4b0:[0x42001371a1f8]Panic_NoSave@vnkernel#nover+0x4d stack: 0x45392d89b510
Bx45392d89b510:[0x42001371a785]Panic_OnAssertAt@vnkernel#nover+0xba stack: 0x1c600008000
Bx45392d09b590:[0x42001376e4cf]Int6_UD2Assert@vnkernel#nover+8x260 stack: 0x0
Bx45392d09b5c0:[0x4200137670b6]gate_entry@vmkernel#nover+0xa7 stack: 0x0
Dx45392d09b680:[0x420015738387]ZDOMSegCache_GetOne@com.vmware.vsan#0.0.0.1+0x1ff stack: 8x45be61ded028
Bx45392d09b6c0:[0x42001572f9f8]ZD0M0bj_SegmentAlloc@com.vnware.vsan#0.0.0.1+0x2al stack: 0x2
Bx45392d09b728:[0x42001573239e]ZD0M0bj_SegmentAllocHithRetry@com.vmware.vsan#0.0.0.1+0x113 stack: 0x8
Bx45392d09b780:[8x42001575d7eb]ZDOMVat_HandleSegments@com.vmware.vsan#0.0.0.1+0xa4 stack: 0x45392d09b950
Bx45392d89b838:[0x42081579d467]ZDOM_ApplyStateMachine@com.vmware.vsan#8.0.0.1+0x4c8 stack: Bx7c
Bx45392d09ba10:[0x4200156d5ba5][email protected]#8.8.0.1+0x276 stack: 0x2d30312d34323032
Bx45392d09baf0:[0x4200156dbe38][email protected]#0.8.0.1+0x175 stack: 0x432404cc4a28
Bx45392d09bb50:[8x4200156da88b][email protected]#0.0.0.1+8x84 stack: 0x45be61ded988
Bx45392d09bbc0:[0x4208157384ae]ZDOMSegCache_Refi11A11@com.vmware.vsan#8.8.0.1+0x2f stack: 0x4324055f4a88
Bx45392d09bc10:[0x4208157383ba]ZDOMOb j_BootstrapWorkRegOb j@com.vmuare.vsan#8.0.8.1+8x483 stack: 0x45bdaf425fd8
0x45392d09bc60:[0x420815786e71]ZDOM0bj_BootstrapReg0b j@com.vnware.vsan#0.8.0.1+8x2c6 stack: 0x8
Bx45392d09bcf0:[0x42001572b12f]ZDOMLibPoolHandler@con.vnware.vsan#8.B.B.1+0x2cc8 stack: Bx0
Bx45392d89be90: [0x4200156e3409]ZDOMLibPreWor1dHandler@com.vmware.vsan#0.8.0.1+0x146 stack: Bx1
Bx45392d89bfaB:[0x42001373a234]vmkWorldFunc@vnkernel#nover+8x31 stack: 0x42001373a238
Bx45392d09bfe0:[0x428813a2c015]CpuSched_StartHorld@vmkernel#nover+0xe2 stack: 8x8
Bx45392d89c000:[0x4200136dbdff]Debug_IsInitialized@vmkernel#nover+8xc stack: 0x8
base fs=0x8 gs=0x420041888888 Kgs=0x8


When the host is rebooted it causes another host in the cluster to PSOD making the cluster unstable

From vmkernel-zdump.log:
2024-10-23T00:12:09.539Z cpu44:2098853)DOM: DOMOwnerGetEncrCtxFromExtAttr:3850: xxxxxxxx-8094-c588-d407-xxxxxxxx: Fetched encrCtx from extAttr with encrActiveKey:0, encrEnabled:0, encrCompliant:1, encrGenNum:0, encrPersisted:1, encrPerObjKey: 1
2024-10-23T00:12:09.539Z cpu44:2098853)DOM: DOMOwnerAddEncrCtxToExtAttr:3936: xxxxxxxx-8094-c588-d407-xxxxxxxxx: Set encryption context in extAttr: encrActiveKey:0, encrEnabled:0, encrCompliant:1, encrGenNum:0, encrPersisted:1, encrPerObjKey: 1
2024-10-23T00:12:09.541Z cpu57:2098926)ZDOMObj_GetEncrKeys:7724: xxxxxxxx-8094-c588-d407-xxxxxxxx: encryption=0(for userlevel), encrKeys[0].keyIdx=0, encrKeys[0].valid=0, encrKeys[1].keyIdx=1, encrKeys[1].valid=0, encodedKeys=0x10000
2024-10-23T00:12:09.541Z cpu57:2098926)VtxTxMgrAlloc:122:xxxxxxxx-8094-c588-d407-xxxxxxxx: Opening transaction system
2024-10-23T00:12:09.541Z cpu57:2098926)ZDOMLib_CreateExecCtx:1679: Created zDOM world 'vtxWB-hot-reg' with ID: 154627192

Environment

VMware vSAN ESA 8.0U2

Cause

The vSAN Host encountered a race condition where a storage IO command containing vSAN object metadata was sent to a data structure which had not yet been created by vSAN. This led to an unhandled exception which resulted in a Purple Diagnostic Screen.  

Resolution

This has been fixed in 8.0U3 and higher. Upgrade vCenter and ESXi to 8.0U3 or higher

If this issue has already been encountered reboot the host(s) that PSODed, collect the logs, open a case with vSAN Support, and upload the logs to the case so support can assist with identifying the object(s) causing the PSOD and remediation steps to stabilize the cluster.