SSP Workload cluster goes down, Control Plane Nodes Crash OOM Due to Accumulation of CNSVolumeOperationRequest
search cancel

SSP Workload cluster goes down, Control Plane Nodes Crash OOM Due to Accumulation of CNSVolumeOperationRequest

book

Article ID: 425275

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Security Services Platform (SSP) control plane nodes may crash with Out of Memory (OOM) errors due to an accumulation of CnsVolumeOperationRequest (CVOR) custom resources. In environments where the vSphere CSI driver is heavily used with a high volume of persistent volume operations (create, delete, update), these CVOR resources pile up and bloat the cluster's etcd database. This overwhelms the Kubernetes API server and causes the control plane nodes to run out of memory, leading to node crashes and service disruption.

Symptoms

  • SSP control plane nodes crash or become unresponsive due to OOM
  • Nodes show as NotReady in kubectl output
  • Control Plane nodes cpu and memory utilization is high.
  • Missing Control Plane nodes.
  • VIP was flapping and IPs were vanishing on VC for the CP nodes.
  • CSI provisioner logs reveal failure to list the CVORs due to API server timeout:
    [YYYY-MM-DDTHH:MM:SS]","caller":"cnsvolumeoperationrequest/cnsvolumeoperationrequest.go:328","msg":"failed to list CnsVolumeOperationRequests with error the server was unable to return a response in the time allotted, but may still be processing the request (get cnsvolumeoperationrequests.cns.vmware.com).

Environment

SSP 5.1

Cause

This issue happens due to CSI controller not purging the successful volume operations.

Resolution

Please contact Broadcom support for the workaround.

Note: This issue will be fixed in the next release of SSP.