CNSVolumeOperationRequest resource piles up within the cluster causing the API Server to crash
search cancel

CNSVolumeOperationRequest resource piles up within the cluster causing the API Server to crash

book

Article ID: 400550

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • In environments where the vSphere CSI driver is heavily used, especially with a high volume of persistent volume operations (create, delete, update), an alarming buildup of CnsVolumeOperationRequest (CVOR) custom resources is observed. This accumulation not only bloats the cluster’s etcd database but also overwhelms the Kubernetes API server, eventually leading to service disruption or crash of the API server itself.

 

  • CSI provisioner logs reveal failure to list the CVORs due to API server timeout. For example:

    [YYYY-MM-DDTHH:MM:SS]\\\",\\\"caller\\\":\\\"cnsvolumeoperationrequest/cnsvolumeoperationrequest.go:328\\\",\\\"msg\\\":\\\"failed to list CnsVolumeOperationRequests with error the server was unable to return a response in the time allotted, but may still be processing the request (get cnsvolumeoperationrequests.cns.vmware.com). 

 

  • The above log snippet confirms an ever-growing number of successful but uncleared CVORs which results in overburdening of the etcd database and API server.

 

  • The following command lists all the CVORs that aren't cleaned up by the CSI driver.

    kubectl get cnsvolumeoperationrequests.cns.vmware.com -n vmware-system-csi

Environment

vSphere CSI Driver

Cause

The root cause of this issue lies in the CSI controller’s behavior during restarts. When the CSI controller restarts, it inadvertently resets the internal cleanup timer for stale CnsVolumeOperationRequest instances. The default cleanup interval is 1440 minutes (24 hours), meaning the system defers resource purging for an entire day, even after a successful volume operation has completed.

Resolution

Broadcom engineering is aware of the issue and is working on a permanent fix to be included in future CSI driver releases. In the meantime, the following step by step cleanup instructions can be used as a workaround to safely delete successfully completed CVOR resources and alleviate the load.

 

Step 1: Scale Down the vSphere CSI Driver.

To ensure no new cnsvolumeoperationrequests are generated during the cleanup, scale down the CSI driver deployment 


kubectl scale deployment vsphere-csi-controller -n vmware-system-csi --replicas=0

 

Step 2: Identify Completed CNSVolumeOperationRequests.

kubectl get cnsvolumeoperationrequests.cns.vmware.com -n vmware-system-csi -o jsonpath='{range .items[?(@.status.latestOperationDetails[0].taskStatus=="Success")]}{.metadata.name}{"\n"}{end}'

Example Output:
pvc-xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx

 

Step 3: Delete Completed Requests.

You can delete these manually using.


kubectl delete cnsvolumeoperationrequests.cns.vmware.com <request-name> -n vmware-system-csi

OR

Use the following command to automate the deletion of all requests with status "Success".


kubectl get cnsvolumeoperationrequests.cns.vmware.com -n vmware-system-csi -o jsonpath='{range .items[?(@.status.latestOperationDetails[0].taskStatus=="Success")]}{.metadata.name}{"\n"}{end}'|xargs kubectl delete cnsvolumeoperationrequests.cns.vmware.com -n vmware-system-csi

 

Example Output:
cnsvolumeoperationrequest.cns.vmware.com "pvc-xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx" deleted

 

Step 4: Scale the CSI Driver Back Up.

After the cleanup is completed, restore the CSI driver to its normal state.

kubectl scale deployment vsphere-csi-controller -n vmware-system-csi --replicas=3

 

Note:

This procedure is safe for removing only those requests with task Status set to Success. Requests with InProgress or Error status should be handled with caution as they may still be under processing or require additional investigation.

Additional Information

The buildup of CnsVolumeOperationRequest resources is a known scalability issue in the current CSI driver implementation. While the official fix is awaited, this manual cleanup approach can significantly reduce pressure on the Kubernetes API server and etcd, preventing crashes and performance degradation. Regular monitoring and scheduled cleanups (where feasible) are recommended in high-volume environments.