Create Cluster or Create Pod remains stuck due to "A general system error occurred: Too many outstanding operations"

Products

VMware vSphere Kubernetes Service

Issue/Introduction

When creating a new vSphere Kubernetes Service (VKS) cluster or a new Pod in an existing cluster, the following issues may occur:

Cluster creation remains stuck in a “False” state and never transitions to Ready.
Pod creation remains in stuck state and never becomes Running.

vCenter tasks fails with the following CNS error:

CNSFault - ServerFaultCode: A general system error occurred: Too many outstanding operations

Once volume attach/detach operations begin to fail, the vSphere CSI driver issues a rapid burst of attach/detach requests to vCenter without any backoff or delay.
This continuous flood of operations fills the vCenter task queue, resulting in the Too many outstanding operations error for all subsequent CSI calls.

Even if the issue originates in a single VKS cluster, it can impact all workloads connected to the same vCenter Server.

Environment

vSphere Kubenetes Service 3.x

vSphere Supervisor on vSphere 8.x or 9.x

Cause

The issue can occur after one or more VKS nodes become inaccessible.

This issue occurs when the vSphere CSI Controller repeatedly attempts failed attach/detach operations without introducing a delay or backoff mechanism.

As a result, the vCenter task queue becomes saturated, and all subsequent operations fail with a generic “Too many outstanding operations” error.

Resolution

To recover from this condition and restore normal operation, perform the following steps:

SSH into the Supervisor Control Plane VMs

SSH into vCenter and run /usr/lib/vmware-wcp/decryptK8Pwd.py

root@vcenter [ ~ ]# /usr/lib/vmware-wcp/decryptK8Pwd.py
Read key from file

Connected to PSQL

Cluster: domain-c#: <supervisor cluster domain id>
IP: <Supervisor FIP>
PWD: <password>
------------------------------------------------------------

SSH into Supervisor Control Plane using obtained credentials.
Scale Down the CSI Controller to temporarily stop all CSI controller operations to prevent further task flooding.
```
kubectl scale deployment vsphere-csi-controller -n vmware-system-csi --replicas=0
```
Identify the Inaccessible Node
1. Log in to vCenter Server.
2. Identify if any TKC Worker VM is marked as inaccessible.
3. Note the name of the inaccessible node.
Delete the Inaccessible Node.
- Make sure the TKC cluster for which node has become inaccessible is not paused.
- If CAPI v1.7.x is present in VKS (TKG Service) version v3.1.0 and beyond or vCenter 8.0 Update 3b and beyond, annotate machine with cluster.x-k8s.io/remediate-machine
```
kubectl annotate machine -n <ns> <machine-name> 'cluster.x-k8s.io/remediate-machine=""' 
```
- If CAPI < v1.7.0, Delete machine using Kubectl
```
kubectl delete machine -n <ns> <machine-name>
```
Remove inaccessible Node VM from vCenter inventory
- To do this, VI admin first needs a break-glass supervisor and bypass containerized permission. See KB article, Bypassing vSphere with Tanzu managed virtual machine permissions for troubleshooting purposes
  
  After VI Admin bypasses the containerized permissions, they should be able to remove in-accessible VKS nodes from VC inventory.
- Right click in-accessible node vm and remove it from inventory.
Clean Up CNSNodeVMAttachment CRs
- Identify any CNSNodeVMAttachment Custom Resource instances associated with the inaccessible node.
- If any of these CRs are stuck in deletion or attachment failed states, remove their finalizers to delete them manually:
```
kubectl patch cnsnodevmattachment <attachment-name> -n <namespace> -p '{"metadata":{"finalizers":[]}}' --type=merge
```
Restart the vCenter VPXD Service
Restarting the vCenter service clears the internal task queue and recovers normal API responsiveness.
```
service-control --stop vmware-vpxd
service-control --start vmware-vpxd
```

Scale Up the CSI Controller to bring the CSI controller back online:

kubectl scale deployment vsphere-csi-controller -n vmware-system-csi --replicas=2