Symptoms:
- Pods in Tanzu Kubernetes Grid Integrated Edition (TKGI) are stuck in a "ContainerCreating" state after restarting the pods.
- You see messages similar to the following when describing (kubectl describe) one of the affected pods:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned mynamespace/mypod
Warning FailedCreatePodSandBox 53s (x13 over 49m) kubelet, 659905c9-e7a7-4e18-ba3e-6f5781f0eb1f Failed to create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Normal SandboxChanged 52s (x13 over 49m) kubelet, 659905c9-e7a7-4e18-ba3e-6f5781f0eb1f Pod sandbox changed, it will be killed and re-created.
- You see messages similar to the following from the kubelet process in the kubelet.stderr.log file:
E0506 19:06:30.386181 10846 cni.go:364] Error adding mynamespace_mypod/33e1d23695349d615fdc9683bc81b6022379eb37907ede67412a0097cf3e69ac to network nsx/nsx-cni: netplugin failed with no error message
- You see messages similar to the following from the hyperbus process on the ESXi host in the nsx-syslog.log file around the same time as the previous message (this indicates a disconnect between the hyperbus and the NSX Node Agent):
2021-05-06T19:06:21Z cfgAgent[2101830]: NSX 2101830 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" s2comp="nsx-rpc" tid="DD079380" level="error" errorCode="RPC104"] RpcCall:Server:ServerStreaming[vmware.nsx.lcp.CifConfigService/SubscribeCifConfig, 0x0001, ERROR] Can't send a message in error state
### lines omitted for brevity ###
2021-05-06T19:06:22Z cfgAgent[2101830]: NSX 2101830 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" s2comp="nsx-net" tid="DD0FB700" level="warn"] StreamConnection[241074 Connecting to tcp://169.254.1.14:2345 sid:241074] Couldn't connect to 'tcp://169.254.1.14:2345' (error: 110-Connection timed out)
2021-05-06T19:06:22Z cfgAgent[2101830]: NSX 2101830 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" s2comp="nsx-net" tid="DD0FB700" level="warn"] StreamConnection[241074 Error to tcp://169.254.1.14:2345 sid:-1] Error 110-Connection timed out
2021-05-06T19:06:22Z cfgAgent[2101830]: NSX 2101830 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" s2comp="nsx-rpc" tid="DD0FB700" level="warn"] RpcConnection[241074 Connecting to tcp://169.254.1.14:2345 0] Couldn't connect to tcp://169.254.1.14:2345 (error: 110-Connection timed out)
2021-05-06T19:06:22Z cfgAgent[2101830]: NSX 2101830 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" s2comp="nsx-rpc" tid="DD0FB700" level="warn"] RpcTransport[0] Unable to connect to tcp://169.254.1.14:2345: 110-Connection timed out
2021-05-06T19:06:22Z cfgAgent[2101830]: NSX 2101830 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" s2comp="nsx-rpc" tid="DD0FB700" level="info"] ConnectionKeeper[6 tcp://169.254.1.14:2345] scheduling connection attempt in 6000 ms
- You see that the NSX Node Agent creates the VIF for the pod but after the CNI is already failed in the same file:
2021-05-06T20:17:15.676Z 659905c9-e7a7-4e18-ba3e-6f5781f0eb1f NSX 25839 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.hyperbus_service Updated app nsx.mynamespace.mypod with IP 172.##.##.143/24, MAC 04:50:56:00:0c:bf, gateway 172.##.##.1/24, vlan 224, CIF 0344371f-f97a-49d5-82eb-2400971535b9, wait_for_sync False