Storage Policy reconfiguration for large object(s) stuck and no progress seen in resync

Products

VMware vSAN

Issue/Introduction

To stabilize the vSAN cluster and clean up the DISCARDED_COMPONENT
Restart the EPD and CMMDS Arena services to resume normal vSAN operations

Symptoms:

Cluster is partitioned and/or rapidly changing node membership.
Sub-Cluster Membership Entry Revision rapidly incrementing due to nodes joining and leaving the cluster constantly.
Commands querying CMMDS (e.g. cmmds-tool) will intermittently not function and/or return constantly changing information.
Active resync not progressing
High DISCARDED_COMPONENTS entries on multiple hosts

Note: DISCARDED_COMPONENTS are per-node counts and potentially only one node has a critical build-up of these.

In vsantraces logs you see the following messages:
2023-06-30T06:43:47.833657 [162009952] [cpu75] [] DOMTraceProcessSubscrEntry:1927: {'obj':0x45dc29f3cd80, 'objType': 'COMP', 'queryType': 35, 'numFiringSubscrs-24': 0, 'numRetrySubscrs-24': 51,
'subscrOp-32': 0x243fd8c0, 'subscrEntry-32': 0x2a3d68c0, 'queryUuid': 'db2dd163-cace-ac06-a423-b483510025bc', 'status': 'VMK_NO_MEMORY', 'isDisabled': False, 'isShared': False, 'isRetry': True, 'processTimeMs':
0, 'fetchesRun': 1, 'unmarshalsRun': 0, 'role': 'DOM_ROLE_COMPONENT_SERVER'}

In vmkernel.log
2023-06-30T05:58:06.885Z cpu127:2099358)WARNING: exprmsh: Error unmarshaling structure CmmdsDiscardedComponentsEntry: Out of memory
2023-06-30T05:58:56.602Z cpu123:2099358)DOM: DOMComponentObjectDeletedEntryCb:12729: Failed to update DISCARDED_ENTRY entry for b22dd163-a815-ca68-cd35-b483510025bc: Out of memory

In cmmdsd.log
2023-06-27T07:00:19.225Z 2099726 WARNING Traversing CMMDS entries returned error: Out of memory
2023-06-27T07:30:19.534Z 2099726 WARNING Traversing CMMDS entries returned error: Out of memory
2023-06-27T08:00:19.855Z 2099726 WARNING Traversing CMMDS entries returned error: Out of memory

In epd.log
2023-06-26T17:20:24.763Z 4645130 PANIC: Unrecoverable memory allocation failure
2023-06-26T17:20:24.763Z 4645130 Backtrace:
2023-06-26T17:20:24.763Z 4645130 Backtrace[0] 0000030ecf9429a0 rip=000000fa299df98f rbx=0000030ecf9429a0 rbp=0000030ecf942dd0 r12=000000fa2a673788 r13=0000030ecf942de8 r14=000000f9e904e1e0 r15=000000f9e905c420

2023-06-27T16:00:46.740Z 4779858 Failed to dump core: Failure.
2023-06-27T16:00:46.740Z 4779858 Msg_Post: Error
2023-06-27T16:00:46.740Z 4779858 [msg.log.error.unrecoverable] VSAN CMMDS persistence daemon unrecoverable error: (epd)
2023-06-27T16:00:46.740Z 4779858 Unrecoverable memory allocation failure
2023-06-27T16:00:46.740Z 4779858 [msg.panic.requestSupport.withoutLog] You can request support.
2023-06-27T16:00:46.740Z 4779858 [msg.panic.requestSupport.vmSupport.vmx86]
2023-06-27T16:00:46.740Z 4779858 To collect data to submit to VMware technical support, run "vm-support".
2023-06-27T16:00:46.740Z 4779858 [msg.panic.response] We will respond on the basis of your support entitlement.
2023-06-27T16:00:46.740Z 4779858 ----------------------------------------

Environment

VMware vSAN 7.0.x

Cause

EPD service in a hung state or not running thus not cleaning up discarded components, which in turn causes CMMDS Arena resources to become exhausted and CMMDS to become unstable.
This is caused by objects of 8TB in size or larger that had a storage policy change not handling the reconfiguation properly resulting in CLOM continuously reconfiguring these objects

Resolution

Upgrade vCenter and ESXi to version 7.0U3k or higher

Workaround:

If upgrade is not possible, for large objects of 8TB or larger use a storage policy with a stripe width of 3.
If the environment is already impacted open a case with VMware vSAN support for remediation.

Attachments

count_discarded_components get_app