TKGI pods stuck in Init status with NSX networking showing error "failed to setup network"

search cancel

book

calendar_today

VMware Tanzu Kubernetes Grid Integrated Edition VMware NSX

Pods deployed on worker nodes are stuck in Init status.
Issue may occur only on some worker nodes or all
Running a kubectl describe pod on the pods stuck in Init status, users will see:

Warning FailedCreatePodSandBox 36m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "<SANDBOX_ID>": plugin type="nsx" failed (add): Failed to receive message header
Pods restarted on the problem nodes will not come back up due to the same failure to setup network for sandbox error.
On the problem worker node, the /var/vcap/sys/log/nsx-node-agent/nsx-node-agent.stdout.log shows errors like:

2025-06-06T11:55:14.620Z 82bd3d06-####-####-####-66dde51c1240 NSX 10996 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="ERROR" errorCode="NCP01004"] nsx_ujo.agent.cni_watcher Unable to retrieve network info for container nsx.<NAMESPACE>.<POD_NAME>, network interface for it will not be configured
In this instance, the NCP logging on the cluster Master nodes provided no additional details on the failure messaging.

The NSX CNI plugin is unable to provide network connectivity to new pods, preventing them from becoming operational.

This occurs when the NSX CCP service on one or all NSX managers temporarily loses connectivity to the NSX Corfu database service
Due to the disconnect, CCP service is not able to access database tables necessary to assign LSPs (logical switchports) to TKGI pods
Loss of connectivity between CCP and Corfu may be caused by a network connectivity issue between NSX managers nodes or with an interruption in storage
When network connectivity is restored, the connection between CCP and Corfu may not be restored automatically
- NSX CCP automatically attempts to reconnect to Corfu, and after 20 attempts the system calls a systemDownHandler event to assist in recovery and restart the CCP service
- However there is a known issue with the systemDownHandler attempt in NSX 4.1 and below

Automatic recovery with the systemDownHandler event is fixed in NSX-T 3.2.4 and NSX 4.2.0.

Workaround

When the issue is occurring, restart the ccp service on the NSX manager node
- SSH to the NSX manager nodes and run the following command as root to restart the ccp service

/etc/init.d/nsx-ccp restart

/etc/init.d/nsx-ccp status

You may be able to monitor NSX logs to see if the issue is occurring by checking for the following text within the ccp logs (/var/log/cloudnet/nsx-ccp.log):

Invoking the systemDownHandler

thumb_up Yes

thumb_down No