vSAN ESA benchmark testing could cause cluster instability
search cancel

vSAN ESA benchmark testing could cause cluster instability

book

Article ID: 383438

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

In rare cases running extremely high IO benchmark testing in vSAN ESA can cause vSAN hosts enter a non-responding state in vCenter and possibly exhibit network isolation symptoms between member ESXi hosts in the ESA Cluster.

In vmkernel logs you may see messages like the following:

ZDOMObj_BootstrapPrepare:839: <UUID>: Failed to create transaction manager: Out of memory
VtxWriteBackHandler:100: <UUID>: Writeback worker [0x4338e294dba8][112337576](hot)(vat) in [0x4338e22e10f8] started
ZDOMObj_BootstrapPrepare:892: <UUID>: Bootstrap prepare failed: Out of memory
ZDOMObj_Exit:4014: <UUID>: Exit
VtxWriteBackHandler:100: <UUID>: Writeback worker [0x4338e2716988][112337577](cold)(vat) in [0x4338e22e10f8] started
ZDOMMiddleMapKeyIdxFixerStartQuiesce:8881: <UUID>: middleMapFixer quiesce started.
ZDOMMoveStopDomLLPComplianceWorker:2104: <UUID>: DOM LLP compliance worker quiesce started.
VtxWriteBackHandler:162: <UUID>: Unexpected errors happen when write back worker waits for signal or time out: World is marked for death
ZDOM_FinalizeReturnStatus:3380: <UUID>: World is marked for death
ZDOM_FinalizeReturnStatus:3384: <UUID>: Final status is not OK: World is marked for death
ZDOMObjHandleFSPUpdate:5407: <UUID>: reg-obj: update=0 useWorker=1 cleanupObj=0: World is marked for death
VtxWriteBackHandler:371: <UUID>: Writeback worker [0x431492003ba8][112337568](cold)(vat) in [0x431492004018] terminated

Additionally:

  • Running "esxcli vsan cluster" command  from ESXi CLI get will show an abnormally large and/or ever increasing number of "Sub-Cluster Membership Entry Revision"
  • vSAN Skyline Health check will show a number of warnings related to cluster system health.

Environment

8.0+ vSAN ESA

Cause

vSAN Memory race condition causing bootstrap failure.

Resolution

If you have hit this condition or are planning benchmark testing on vSAN ESA please follow the below instructions to avoid hitting this cluster instability condition.

Step 1:

Upgrade ESXi hosts to 8.0.3 Patch 04. These Settings should only be applied once cluster is upgraded to 8.0.3 Patch 4.

Step 2:

Configure the following setting on every existing ESXi host/node in cluster:

Disable PerOpTrace with setting  on command line

            esxcfg-advcfg -s 0 /VSAN/DomUsePerOpTraceBuffer


 Verify setting  on command line

             esxcfg-advcfg -g /VSAN/DomUsePerOpTraceBuffer.

            expected value should be: 0

            If configuration is incorrect it will output to 1

Note:
     - This setting does not need a reboot to take effect. 
     - Once configured, the setting is persistent.
     - This setting can be set on OSA and ESA clusters 

For consistency customers may revert this advanced setting back to 1 after upgrading to a fixed version.

command:

esxcfg-advcfg -s 1 /VSAN/DomUsePerOpTraceBuffer

 

 

Additional Information

If you have questions or concerns about this condition please open a support case with Broadcom/VMware.