node-agent
plugin on the worker node fails to start successfully, resulting in some of the kube-system
pods becoming stuck in a “container create
" state.
The NCP node-agent
plugin on the worker node fails to start successfully. As a result, the following error message is outputted:
id="docs-internal-guid-af0aab83-7fff-ea8e-8814-86b26d062532">“ Cannot find OVS port for container 1d418d684a6b8eda37796ffb38e0be6dc19349b91fd348401c3a97d73b2dc048, skipped deleting”
This causes the unsuccessful start of the NCP node-agent
plugin, leading to cluster creation failure.
On the underlying host, the hyperbus process status is marked “unhealthy
” between the node agent and the hyperbus.
This error manifests itself in the NSX node agent, where worker VM failed to start the nsx-node-agent
:
[“NCP01004"] nsx_ujo.agent.cni_watcher Unable to retrieve network info for container nsx.kube-system.coredns-54586579f6-fqjh6, network interface for it will not be configured
To debug cluster creation failures, use BOSH to inspect the VMs in the cluster job for failures.
1. Use the following BOSH command to list all of the PKS cluster VMs. Check to see if any of the VMs are in a failing state:
ubuntu@opsmgr-customer0-io:~$ BOSH_CLIENT=ops_manager BOSH_CLIENT_SECRET=eh-j8ln1AyVSsmE-QvCNwEy67jg6EWXU BOSH_CA_CERT=/var/tempest/workspaces/default/root_ca_certificate BOSH_ENVIRONMENT=192.168.1.11 bosh vms
Note: In this case, the worker VM is in a failing state.
2. SSH into the worker VM, which failed in this instance, using the following BOSH SSH command:
BOSH_CLIENT=ops_manager BOSH_CLIENT_SECRET=eh-j8ln1AyVSsmE-QvCNwEy67jg6EWXU BOSH_CA_CERT=/var/tempest/workspaces/default/root_ca_certificate BOSH_ENVIRONMENT=192.168.1.11 bosh -d service-instance_6675c28d-09f4-467d-8583-75e9b9d8a448 ssh worker/206b453e-1869-4961-972a-27b53a3f763a
3. Change to /var/vcap/sys/log/nsx-node-agent
, inspect nsx-node-agent.stdout.log
:
See the following message:
1 2019-05-13T14:55:29.570Z 251f9a5c-bba2-43b6-a520-5b208a00c4bc NSX 13943 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Cannot find OVS port for container 1d418d684a6b8eda37796ffb38e0be6dc19349b91fd348401c3a97d73b2dc048, skipped deleting 1 2019-05-13T15:00:29.862Z 251f9a5c-bba2-43b6-a520-5b208a00c4bc NSX 13943 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="ERROR" errorCode="NCP01004"] nsx_ujo.agent.cni_watcher Unable to retrieve network info for container nsx.kube-system.coredns-54586579f6-fqjh6, network interface for it will not be configured 1 2019-05-13T15:00:30.865Z 251f9a5c-bba2-43b6-a520-5b208a00c4bc NSX 13943 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Cannot find OVS port for container c495ecceb06272e56c7dd35a749d4a8a856f2ae6960328cdbccb912fc530f56b, skipped deleting 1 2019-05-13T15:05:31.153Z 251f9a5c-bba2-43b6-a520-5b208a00c4bc NSX 13943 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="ERROR" errorCode="NCP01004"] nsx_ujo.agent.cni_watcher Unable to retrieve network info for container nsx.kube-system.coredns-54586579f6-fqjh6, network interface for it will not be configured 1 2019-05-13T15:05:32.156Z 251f9a5c-bba2-43b6-a520-5b208a00c4bc NSX 13943 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Cannot find OVS port for container 8d382a636aa37062a38e3d41c109bb401159ef657d38d03d30efc2af0930253e, skipped deleting 1 2019-05-13T15:10:32.450Z 251f9a5c-bba2-43b6-a520-5b208a00c4bc NSX 13943 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="ERROR" errorCode="NCP01004"] nsx_ujo.agent.cni_watcher Unable to retrieve network info for container nsx.kube-system.coredns-54586579f6-fqjh6, network interface for it will not be configuredThe message above explains that the worker VM failed to start the
nsx-node-agent
. The NSX node agent is unable to retrieve network information for nsx.kube-system.coredns pod
.sudo chroot /var/vcap/data/nsx-node-agent/rootfs/ nsxcli NSX CLI (Node Agent). Press ? for command list or enter: help 8f0b9f7a-cba7-44a3-a39c-2a46dac5d538> get container-cache Container cache container-caches All container caches file File files Files node-agent-hyperbus Connection status between node-agent and hyperbus node-agent-log-level Node-agent log level version System version 8f0b9f7a-cba7-44a3-a39c-2a46dac5d538> get node-agent-hyperbus status HyperBus status: Unhealthy
When you run kubectl get pods -o wide --all-namespaces
in the Kubernetes cluster, the pods become stuck in the “container creating
” state.
To resolve this issue, restart the underlying host. More specifically, restart the netcpad
service on VMware ESXi with the following command:
/etc/init.d/netcpad restart