SSP Workload cluster goes down, Control Plane Nodes Crash OOM Due to Accumulation of CNSVolumeOperationRequest

search cancel

SSP Workload cluster goes down, Control Plane Nodes Crash OOM Due to Accumulation of CNSVolumeOperationRequest

book

Article ID: 425275

calendar_today

Updated On:

Products

VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Security Services Platform (SSP) control plane nodes may crash with Out of Memory (OOM) errors due to an accumulation of CnsVolumeOperationRequest (CVOR) custom resources. In environments where the vSphere CSI driver is heavily used with a high volume of persistent volume operations (create, delete, update), these CVOR resources pile up and bloat the cluster's etcd database. This overwhelms the Kubernetes API server and causes the control plane nodes to run out of memory, leading to node crashes and service disruption.

Symptoms

SSP control plane nodes crash or become unresponsive due to OOM
Nodes show as NotReady in kubectl output
Control Plane nodes cpu and memory utilization is high.
Missing Control Plane nodes.
VIP was flapping and IPs were vanishing on VC for the CP nodes.
CSI provisioner logs reveal failure to list the CVORs due to API server timeout:
[YYYY-MM-DDTHH:MM:SS]","caller":"cnsvolumeoperationrequest/cnsvolumeoperationrequest.go:328","msg":"failed to list CnsVolumeOperationRequests with error the server was unable to return a response in the time allotted, but may still be processing the request (get cnsvolumeoperationrequests.cns.vmware.com).

Environment

SSP 5.1.0

Cause

This issue happens due to CSI controller not purging the successful volume operations.

Resolution

Please contact Broadcom support for the workaround.

NOTE: This issue is fixed in SSP 5.1.1. However, Broadcom recommends you to apply the workaround FIRST before upgrading to SSP 5.1.1

Feedback

thumb_up Yes

thumb_down No