vSAN Witness Component count at 100% due to TCP Keepalive Failures in RDT Stack and DOM Component Leak
search cancel

vSAN Witness Component count at 100% due to TCP Keepalive Failures in RDT Stack and DOM Component Leak

book

Article ID: 398955

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms

  1. Skyline Health warns of component limit exceeding on witness node
  2. The utilization shows the witness having max components than the data nodes.

From /var/run/log/clomd.log we observe clomd is performing concurrent repair tasks to address missing witness components across multiple objects

YYYY-MM-DDTHH:MM:SSZ No(29) clomd[5535814]: [Originator@6876 opID=1804373051] CLOMReconfigure: exit: obj 575c4565-####-####-####-#### transiantCapGenerated - total: 0, site1: 0, site2: 0, workItem type REPAIR configDelay 0 newConfigGenerated 1 newCompWitnessOnly 1 status Success

YYYY-MM-DDTHH:MM:SSZ No(29) clomd[5535814]: [Originator@6876 opID=1804373052] CLOMReconfigure: exit: obj 84deaf65-####-####-####-#### transiantCapGenerated - total: 0, site1: 0, site2: 0, workItem type REPAIR configDelay 0 newConfigGenerated 1 newCompWitnessOnly 1 status Success

YYYY-MM-DDTHH:MM:SSZ No(29) clomd[5535814]: [Originator@6876 opID=1804373054] CLOMReconfigure: exit: obj 4d397a64-####-####-####-#### transiantCapGenerated - total: 0, site1: 0, site2: 0, workItem type REPAIR configDelay 0 newConfigGenerated 1 newCompWitnessOnly 1 status Success

YYYY-MM-DDTHH:MM:SSZ No(29) clomd[5535814]: [Originator@6876 opID=1804373055] CLOMReconfigure: exit: obj 26437f64-####-####-####-#### transiantCapGenerated - total: 0, site1: 0, site2: 0, workItem type REPAIR configDelay 0 newConfigGenerated 1 newCompWitnessOnly 1 status Success

YYYY-MM-DDTHH:MM:SSZ No(29) clomd[5535814]: [Originator@6876 opID=1804373056] CLOMReconfigure: exit: obj 352c7f64-####-####-####-#### transiantCapGenerated - total: 0, site1: 0, site2: 0, workItem type REPAIR configDelay 0 newConfigGenerated 1 newCompWitnessOnly 1 status Success

From /var/run/log/clomd.log taking one of witness component (identified by UUID 26437f64-####-####-####-#### ), due to RDT session timeouts on the witness node, associated witness components are marked absent. Clom detects the absence and repeatedly initiates repair operations, leading to a recurring repair loop.

YYYY-MM-DDTHH:MM:SSZ No(29) clomd[5535814]: [Originator@6876] CLOM_PostWorkItem: Posted a work item opID:1804373061 for 26437f64-####-####-####-#### group: 00000000-0000-0000-0000-000000000000 Type: REPAIR delay 0 (Success)
YYYY-MM-DDTHH:MM:SSZ No(29) clomd[5535814]: [Originator@6876 opID=1804373061] CLOMProcessWorkItem: Op REPAIR starts:1804373061
YYYY-MM-DDTHH:MM:SSZ No(29) clomd[5535814]: [Originator@6876 opID=1804373061] CLOMReconfigure: Reconfiguring 26437f64-####-####-####-#### workItem type REPAIR
YYYY-MM-DDTHH:MM:SSZ Er(27) clomd[5535814]: [Originator@6876 opID=1804373061] CLOMReplacementPreWorkRepair: Repair needed. 1 absent/degraded data components for 26437f64-####-####-####-#### found
YYYY-MM-DDTHH:MM:SSZ No(29) clomd[5535814]: [Originator@6876 opID=1804373061] CLOMReconfigure: exit: obj 26437f64-####-####-####-#### transiantCapGenerated - total: 0, site1: 0, site2: 0, workItem type REPAIR configDelay 0 newConfigGenerated 1 newCompWitnessOnly 1 status Success
YYYY-MM-DDTHH:MM:SSZ No(29) clomd[5535814]: [Originator@6876 opID=1804373061] CLOM_PublishResyncBytes: No more work for 26437f64-####-####-####-#### (Success), reset queued resync bytes to 0

Environment

VMware vSAN 8.x

Cause

Witness nodes do not receive the required entries during reconfiguration. Without these entries, the witness cannot properly validate or clean up components, resulting in leaked components.

Resolution

To remediate this issue, upgrade the ESXi hosts to a version that includes the required bug fix:

  • vSphere 8.0 Patch 05 (8.0 P05)

  • vSphere 9.0 GA or later releases

If your environment is not currently running a version that contains this fix, contact Broadcom Support for further assistance.