Pods deployed on worker nodes are stuck in Init status.
Issue may occur only on some worker nodes or all
Running a kubectl describe pod on the pods stuck in Init status, users will see:
Warning FailedCreatePodSandBox 36m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "<SANDBOX_ID>": plugin type="nsx" failed (add): Failed to receive message header
Pods restarted on the problem nodes will not come back up due to the same failure to setup network for sandbox error.
On the problem worker node, the /var/vcap/sys/log/nsx-node-agent/nsx-node-agent.stdout.log shows errors like:
2025-06-06T11:55:14.620Z 82bd3d06-####-####-####-66dde51c1240 NSX 10996 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="ERROR" errorCode="NCP01004"] nsx_ujo.agent.cni_watcher Unable to retrieve network info for container nsx.<NAMESPACE>.<POD_NAME>, network interface for it will not be configured
In this instance, the NCP logging on the cluster Master nodes provided no additional details on the failure messaging.
Environment
VMware Tanzu Kubernetes Grid 1.x
NSX-T 3.2.x
NSX 4.x
Cause
The NSX CNI plugin is unable to provide network connectivity to new pods, preventing them from becoming operational.
This occurs when the NSX CCP service on one or all NSX managers temporarily loses connectivity to the NSX Corfu database service
Due to the disconnect, CCP service is not able to access database tables necessary to assign LSPs (logical switchports) to TKGI pods
Loss of connectivity between CCP and Corfu may be caused by a network connectivity issue between NSX managers nodes or with an interruption in storage
When network connectivity is restored, the connection between CCP and Corfu may not be restored automatically
NSX CCP automatically attempts to reconnect to Corfu, and after 20 attempts the system calls a systemDownHandler event to assist in recovery and restart the CCP service
However there is a known issue with the systemDownHandler attempt in NSX 4.1 and below
Resolution
Automatic recovery with the systemDownHandler event is fixed in NSX-T 3.2.4 and NSX 4.2.0.
Workaround
When the issue is occurring, restart the ccp service on the NSX manager node
SSH to the NSX manager nodes and run the following command as root to restart the ccp service
/etc/init.d/nsx-ccp restart
Run the following command to confirm the service is active and running
/etc/init.d/nsx-ccp status
You may also restart the NSX managers nodes individually
Additional Information
You may be able to monitor NSX logs to see if the issue is occurring by checking for the following text within the ccp logs (/var/log/cloudnet/nsx-ccp.log):