In rare cases running extremely high IO benchmark testing in vSAN ESA can cause vSAN hosts enter a non-responding state in vCenter and possibly exhibit network isolation symptoms between member ESXi hosts in the ESA Cluster.
In vmkernel logs you may see messages like the following:
ZDOMObj_BootstrapPrepare:839: <UUID>: Failed to create transaction manager: Out of memory
VtxWriteBackHandler:100: <UUID>: Writeback worker [0x4338e294dba8][112337576](hot)(vat) in [0x4338e22e10f8] started
ZDOMObj_BootstrapPrepare:892: <UUID>: Bootstrap prepare failed: Out of memory
ZDOMObj_Exit:4014: <UUID>: Exit
VtxWriteBackHandler:100: <UUID>: Writeback worker [0x4338e2716988][112337577](cold)(vat) in [0x4338e22e10f8] started
ZDOMMiddleMapKeyIdxFixerStartQuiesce:8881: <UUID>: middleMapFixer quiesce started.
ZDOMMoveStopDomLLPComplianceWorker:2104: <UUID>: DOM LLP compliance worker quiesce started.
VtxWriteBackHandler:162: <UUID>: Unexpected errors happen when write back worker waits for signal or time out: World is marked for death
ZDOM_FinalizeReturnStatus:3380: <UUID>: World is marked for death
ZDOM_FinalizeReturnStatus:3384: <UUID>: Final status is not OK: World is marked for death
ZDOMObjHandleFSPUpdate:5407: <UUID>: reg-obj: update=0 useWorker=1 cleanupObj=0: World is marked for death
VtxWriteBackHandler:371: <UUID>: Writeback worker [0x431492003ba8][112337568](cold)(vat) in [0x431492004018] terminated
Additionally:
8.0+ vSAN ESA
vSAN Memory race condition causing bootstrap failure.
If you have hit this condition or are planning benchmark testing on vSAN ESA please follow the below instructions to avoid hitting this cluster instability condition.
Step 1:
Upgrade ESXi hosts to 8.0.3 Patch 04. These Settings should only be applied once cluster is upgraded to 8.0.3 Patch 4.
Step 2:
Configure the following setting on every existing ESXi host/node in cluster:
Disable PerOpTrace with setting on command line
esxcfg-advcfg -s 0 /VSAN/DomUsePerOpTraceBuffer
Verify setting on command line
esxcfg-advcfg -g /VSAN/DomUsePerOpTraceBuffer.
expected value should be: 0
If configuration is incorrect it will output to 1
Note:
- This setting does not need a reboot to take effect.
- Once configured, the setting is persistent.
- This setting can be set on OSA and ESA clusters
For consistency customers may revert this advanced setting back to 1 after upgrading to a fixed version.
command:
esxcfg-advcfg -s 1 /VSAN/DomUsePerOpTraceBuffer
If you have questions or concerns about this condition please open a support case with Broadcom/VMware.