TKGI pods stuck in Init status with NSX networking showing error "failed to setup network"
search cancel

TKGI pods stuck in Init status with NSX networking showing error "failed to setup network"

book

Article ID: 400755

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition VMware NSX

Issue/Introduction

  • Pods deployed on worker nodes are stuck in Init status.
  • Issue may occur only on some worker nodes or all
  • Running a kubectl describe pod on the pods stuck in Init status, users will see:

    Warning  FailedCreatePodSandBox  36m                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "<SANDBOX_ID>": plugin type="nsx" failed (add): Failed to receive message header

  • Pods restarted on the problem nodes will not come back up due to the same failure to setup network for sandbox error.
  • On the problem worker node, the /var/vcap/sys/log/nsx-node-agent/nsx-node-agent.stdout.log shows errors like:

    2025-06-06T11:55:14.620Z 82bd3d06-####-####-####-66dde51c1240 NSX 10996 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="ERROR" errorCode="NCP01004"] nsx_ujo.agent.cni_watcher Unable to retrieve network info for container nsx.<NAMESPACE>.<POD_NAME>, network interface for it will not be configured

  • In this instance, the NCP logging on the cluster Master nodes provided no additional details on the failure messaging.

Environment

  • VMware Tanzu Kubernetes Grid 1.x
  • NSX-T 3.2.x
  • NSX 4.x

 

Cause

The NSX CNI plugin is unable to provide network connectivity to new pods, preventing them from becoming operational. 

  • This occurs when the NSX CCP service on one or all NSX managers temporarily loses connectivity to the NSX Corfu database service
  • Due to the disconnect, CCP service is not able to access database tables necessary to assign LSPs (logical switchports) to TKGI pods
  • Loss of connectivity between CCP and Corfu may be caused by a network connectivity issue between NSX managers nodes or with an interruption in storage
  • When network connectivity is restored, the connection between CCP and Corfu may not be restored automatically
    • NSX CCP automatically attempts to reconnect to Corfu, and after 20 attempts the system calls a systemDownHandler event to assist in recovery and restart the CCP service
    • However there is a known issue with the systemDownHandler attempt in NSX 4.1 and below

Resolution

  • Automatic recovery with the systemDownHandler event is fixed in NSX-T 3.2.4 and NSX 4.2.0.

Workaround

  • When the issue is occurring, restart the ccp service on the NSX manager node
    • SSH to the NSX manager nodes and run the following command as root to restart the ccp service

/etc/init.d/nsx-ccp restart 

    • Run the following command to confirm the service is active and running

/etc/init.d/nsx-ccp status

  • You may also restart the NSX managers nodes individually

Additional Information

  • You may be able to monitor NSX logs to see if the issue is occurring by checking for the following text within the ccp logs (/var/log/cloudnet/nsx-ccp.log):

Invoking the systemDownHandler