vsphere-csi-controller pods and nsx-ncp pods are in CrashLoopBackOff- Failed to get or renew SAML HoK from STS due to failed DNS lookup for VC endpoint: [Errno -3] Lookup timed out

Products

VMware vSphere Kubernetes Service

Issue/Introduction

vsphere-csi-controller pods and nsx-ncp pods are in CrashLoopBackOff

# kubectl get pods -A | grep -v Run
NAMESPACE                                   NAME                                                              READY   STATUS             RESTARTS           AGE
vmware-system-csi                           vsphere-csi-controller-xxxxxx-xxxx                            3/7     CrashLoopBackOff   2602 (3m12s ago)   44h
vmware-system-csi                           vsphere-csi-controller-xxxxxx-xxxx                            3/7     CrashLoopBackOff   2581 (32s ago)     44h
vmware-system-csi                           vsphere-csi-controller-xxxxxx-xxxx                            6/7     CrashLoopBackOff   653 (75s ago)      44h
vmware-system-nsx                           nsx-ncp-xxxxxx-xxxx                                           0/2     CrashLoopBackOff   4090 (2m2s ago)    11d
vmware-system-nsx                           nsx-ncp-xxxxxx-xxxx                                           0/2     CrashLoopBackOff   1560 (13s ago)     44h

The vsphere-csi-controller and nsx-ncp logs indicate failed to re-establish VC connection like below

csi-controller log:

2025-02-19T03:54:08.233512036Z stderr F {"level":"error","time":"2025-02-19T03:54:08.233469274Z","caller":"wcp/controller.go:286","msg":"failed to re-establish VC connection. Will retry again in 60 seconds. err: failed to connect to VirtualCenter host: \"vc-example.in.co\", Err: Post \"https://vc-example.in.co:443/sdk\": dial tcp: lookup vc-example.in.co on 127.0.0.53:53: read udp 127.0.0.1:37179->127.0.0.53:53: i/o timeout","TraceId":"xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service/wcp.(*controller).Init.func2\n\t/build/mts/release/bora-23905383/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/csi/service/wcp/controller.go:286"}
2025-02-19T03:54:08.233534074Z stderr F {"level":"info","time":"2025-02-19T03:54:08.2333685Z","caller":"vsphere/virtualcenter.go:384","msg":"Reloading latest VC config from vSphere Config Secret for vcenter: \"vc-example.in.co\"","TraceId":"xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"}
2025-02-19T03:54:08.233782435Z stderr F {"level":"info","time":"2025-02-19T03:54:08.233738354Z","caller":"vsphere/utils.go:259","msg":"Defaulting timeout for vCenter Client to 5 minutes","TraceId":"xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"}

nsx-ncp log:

[ncp GreenThread-1 I] nsx_ujo.ncp.k8s.kubernetes HTTP session did not have a 'Content-type' header
[ncp GreenThread-1 I] nsx_ujo.ncp.k8s.kubernetes HTTP session did not have a 'Content-type' header
[ncp MainThread I] nsx_ujo.ncp.vc.session Refreshing token and re-instantiating TESSession
[ncp MainThread I] nsx_ujo.ncp.vc.session Retrieving VC Credentials for the first time
[ncp GreenThread-1 I] nsx_ujo.ncp.k8s.kubernetes HTTP session did not have a 'Content-type' header
[ncp MainThread W] nsx_ujo.ncp.vc.session Failed to get JWT token: Failed SAML HoK request: Failed to get or renew SAML HoK from STS due to failed DNS lookup for VC endpoint: [Errno -3] Lookup timed out., will retry after 120 seconds
[ncp GreenThread-1 I] nsx_ujo.ncp.k8s.kubernetes HTTP session did not have a 'Content-type' header
[ncp GreenThread-1 I] nsx_ujo.ncp.k8s.kubernetes HTTP session did not have a 'Content-type' header
[ncp GreenThread-1 I] nsx_ujo.ncp.election Seqno expired for master xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx
[ncp GreenThread-1 I] nsx_ujo.ncp.election Instance xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx elected master.
Session terminated, terminating shell...[ncp MainThread W] nsx_ujo.ncp.main Receive signal for handling 15
[ncp MainThread W] nsx_ujo.ncp.main Main process is exiting, terminate election process!
 ...terminated.

unable to resolve VC and nsxt manager fqdn/IP from the supervisor

# nslookup <vc-fqdn>
;; connection timed out; no servers could be reached

# nslookup <nsx-fqdn>
;; connection timed out; no servers could be reached

unable to reach the DNS server

# curl -vk telnet://<dns-IP>:53
* Trying <dns-IP>:53...
* connect to <dns-IP> port 53 failed: Connection timed out
* Failed to connect to <dns-IP> port 53 after 130279 ms: Couldn't connect to server
* Closing connection 0
curl: (28) Failed to connect to <dns-IP> port 53 after 130279 ms: Couldn't connect to server

Environment

VMware vSphere with Tanzu

Cause

Connectivity issues between Supervisor and VC or NSX

Resolution

Customer to work with the internal network team to ensure all required ports are open and validate if there are any firewall rules set

Supervisor should be able to communicate to the domain controllers

Connectivity from supervisor to the vCenter and NSX edge VM's should be verified and ensure to work as expected

Once the networking issues are resolved, the original issue with auto resolve and pods will come back to Running state

Additional Information

Engage VMware by Broadcom's networking team to check the connectivity issues, if required