vSAN cluster Partition with error " Arena memory exhausted"

Products

VMware vSAN

Issue/Introduction

This article explains :

To reconnect the host in network partitioned vSAN cluster and clean up the DISCARDED_COMPONENT.
Restart the epd and CMMDS Arena service to resume normal vSAN operation.

Symptoms:

Cluster is partitioned and/or rapidly changing node membership.
Sub-Cluster Membership Entry Revision rapidly incrementing due to nodes joining and leaving the cluster constantly.
Commands querying CMMDS (e.g. cmmds-tool) will intermittently not function and/or return constantly changing information.
Excessively high DISCARDED_COMPONENTS entries count (e.g. 10k-1Mil)

For the host getting into partitioned state and joining cluster , Admin can confirm from logging that CMMDS is having issues:

# grep 'arena space' /var/log/vmkernel.log

Confirm via CLI that the cluster membership is incrementing by running this command a few times a few seconds apart:

#localcli vsan cluster get
Cluster Information
   Enabled: true
   Current Local Time: 2019-07-28T03:26:18Z
   Local Node UUID: < ABC >
....... output is snipped  .......
  Sub-Cluster Membership Entry Revision: 5 >> Starts incrementing on impacted hosts
  Sub-Cluster Member Count:
  Config Generation: < XYZ > 18 2019-07-08T18:05:37.650 >> Host State change time

The “epd” service not running due to lock and it can be verified in the /var/log/epd.log by checking following traces

2019-04-13T21:23:44.110Z 70451 -- XXX: Lock file '/scratch/epd-store.db' exists.
2019-04-13T21:23:44.110Z 70451 -- XXX: Did the host or EPD recently crash?
2019-04-13T21:23:44.110Z 70451 -- XXX: Assuming it's OK. Unlinking lock file..
2019-04-13T21:23:44.111Z 70451 Failed to delete lock file: Is a directory
2019-04-13T21:23:44.111Z 70451 SRV: failed to open db: Failure
2019-04-13T21:23:44.111Z 70451 SRV: init for store-mgmt failed: Failure
2019-04-13T21:23:44.111Z 70451 SRV: initialization failed: Failure

Error messages such as the following in ESXi /var/run/log/vmkernel.log :

2018-05-13T13:36:03.028Z cpu16:67526)CMMDS: RejoinBuildSnapshotEntry:2043: Failed to allocate arena space for snapshot

2018-05-13T13:36:03.028Z cpu16:67526)CMMDS: CMMDSLogStateTransition:1309: Transitioning(xxxxxxxx-1f8f-1364-8400-xxxxxxxxxxxx) from Rejoin to Discovery: (Reason: Arena memory exhausted)

2018-05-13T13:59:20.463Z cpu46:67519)CMMDS: RejoinBuildSnapshotEntry:2043: Failed to allocate arena space for snapshot

2018-05-13T13:59:20.463Z cpu46:67519)CMMDS: CMMDSLogStateTransition:1309: Transitioning(xxxxxxxx-f728-bd70-67aa-xxxxxxxxxxxx) from Rejoin to Discovery: (Reason: Arena memory exhausted)

Environment

VMware vSAN 6.x
VMware vSAN 7.x

Cause

EPD service not running and thus not cleaning up discarded components, which in turn causes CMMDS Arena resources to become exhausted and CMMDS to become unstable.
The cause of the initial EPD service issues relate to how this service starts on boot and where it writes on start up.
The drivers with known issues such as Emulex elxiscsi 11.2.1152.0-1OEM.650.0.0.4240417 can break the configured scratch location and as a result impact how this service starts.

Following misconfiguration have been found to cause the issue when the vSAN service requires scratch to execute :

- The Scratch partition/location is not available for logging.
- H730P controllers with incorrect disk-access mode and scratch partition is not available/partially available for log.
- Unsupported mixed disk-access modes (e.g. RAID1 boot/log devices + passthrough vSAN devices) and scratch partition is not available/partially available for log.

Resolution

The ESXI version ESXi 7.0 Update 3f (Build 20036589) and onwards higher version has enhancements in terms of handling of DISCARDED COMPONENTS.

If issue is observed in latest builds, the above symptoms and issue matches, please contact Broadcom Support to investigate the issue and workaround.