VKS Supervisor Cluster stuck in a deleting state due to NSX cleanup failure

Products

VMware vSphere Kubernetes Service

Issue/Introduction

After attempting to remove a supervisor cluster from vCenter the process is stuck in a 'Deleting' status due to NSX objects not being deleted
The vSphere UI may show "Cleanup requests to NSX Manager Failed"
Supervisor cluster VMs have been deleted successfully, but all NSX objects within NSX still remain
The /var/log/vmware/wcp/wcpsvc.log shows the following errors

2025-12-04T18:18:05.483Z info wcp [cleanup/vpc.go:120] [opID=6930ef9c-21a43e79-86a3-4fdd-ad12-b86f0186b3f9] Running Avi cleanup on host 'xx.xx.xx.xx' with user 'username' for cluster 'domain-cXXXX:aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee'
2025-12-04T18:18:05.496Z debug wcp [cleanup/vpc.go:169] [opID=6930ef9c-21a43e79-86a3-4fdd-ad12-b86f0186b3f9] NSX cleanup initiated

2025-12-04T18:19:49.039Z error wcp [common/k8sdeploymentutil.go:38] [opID=1d3fb1f8-51ce-9ca3-adfc-c7fdc08a40b3] Unable to get deployment status of vmware-system-nsx/nsx-ncp. Err: Resource Type ClusterComputeResource, Identifier domain-cXXXX is not found.
2025-12-04T18:20:19.577Z error wcp [common/k8sdeploymentutil.go:38] [opID=6930f1ad] Unable to get deployment status of vmware-system-nsx/nsx-ncp. Err: Resource Type ClusterComputeResource, Identifier domain-cXXXX is not found.
2025-12-04T18:20:19.614Z error wcp [common/k8sdeploymentutil.go:38] [opID=6930f1b0] Unable to get deployment status of vmware-system-nsx/nsx-ncp. Err: Resource Type ClusterComputeResource, Identifier domain-cXXXX is not found.
2025-12-04T18:20:39.999Z error wcp [kubelifecycle/kube_instance.go:2688] Failed to get Kubernetes healthz results on server, xx.xx.xx.xx: Get "http://localhost:1080/external-cert/http1/xx.xx.xx.xx/6443/healthz?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2025-12-04T18:20:46.773Z error wcp [cleanup/vpc.go:171] [opID=6930ef9c-21a43e79-86a3-4fdd-ad12-b86f0186b3f9] NSX cleanup failed: failed to clean up
2025-12-04T18:20:46.776Z error wcp [cleanup/vpc.go:77] [opID=6930ef9c-21a43e79-86a3-4fdd-ad12-b86f0186b3f9] Error cleaning NSX resources: NSX cleanup failed: failed to clean up
2025-12-04T18:20:46.776Z error wcp [kubelifecycle/controller.go:2531] [opID=6930ef9c-21a43e79-86a3-4fdd-ad12-b86f0186b3f9] Teardown of external appliance resources failed. Err: error cleaning NSX resources for Supervisor 'aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee': failed to perform NSX cleanup for Supervisor 'aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee': NSX cleanup failed for Supervisor 'aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee': NSX cleanup failed: failed to clean up
2025-12-04T18:20:46.776Z warning wcp [kubelifecycle/controller.go:478] [opID=6930ef9c-21a43e79-86a3-4fdd-ad12-b86f0186b3f9] Unable to disable cluster domain-cXXXX because of the reason [FailedWithSystemError]. Err error cleaning NSX resources for Supervisor 'aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee': failed to perform NSX cleanup for Supervisor 'aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee': NSX cleanup failed for Supervisor 'aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee': NSX cleanup failed: failed to clean up

The /var/log/vmware/wcp/stdstream.log.stdout* log file shows the following errors

2025-12-08 15:10:24.408 ERROR util/utils.go:263 Handle HTTP response {"status": 400, "request URL": "http://localhost:1080/external-cert/http1/NSX_MANAGER1/443/api/v1/reverse-proxy/node/health", "response body": "Destination server data is invalid", "error": "received HTTP Bad Request Error"}
2025-12-08 15:10:24.408 ERROR nsx/endpoint.go:174 Failed to validate API cluster {"endpoint": "NSX_MANAGER1"}
2025-12-08 15:10:24.408 ERROR util/utils.go:263 Handle HTTP response {"status": 400, "request URL": "http://localhost:1080/external-cert/http1/NSX_MANAGER2/443/api/v1/reverse-proxy/node/health", "response body": "Destination server data is invalid", "error": "received HTTP Bad Request Error"}
2025-12-08 15:10:24.408 ERROR nsx/endpoint.go:174 Failed to validate API cluster {"endpoint": "NSX_MANAGER2"}
2025-12-08 15:10:24.408 ERROR util/utils.go:263 Handle HTTP response {"status": 400, "request URL": "http://localhost:1080/external-cert/http1/NSX_MANAGER3/443/api/v1/reverse-proxy/node/health", "response body": "Destination server data is invalid", "error": "received HTTP Bad Request Error"}
2025-12-08 15:10:24.408 ERROR nsx/endpoint.go:174 Failed to validate API cluster {"endpoint": "NSX_MANAGER3"}

Environment

vCenter Server 8.x/9.x

NSX 4.x/9.x

Cause

This issue is caused due to vCenter Server not being able to communicate with the NSX Manager nodes

Resolution

Ensure vCenter can communicate directly with the NSX Managers on port 443 and resolve the NSX Managers via DNS. Remove any firewall rules (such as DFW rules via NSX) that may be blocking communication between vCenter and NSX.

After unblocking 443 communication from vCenter to NSX the supervisor cluster will continue the cleanup process.

To verify port 443 communication is open you can open an SSH session directly to vCenter and run the following command:

curl -v telnet://NSX_Manager_IP:443