PKS 1.4 cluster creation failure, NCP node agent fails

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Symptoms:
Apply addons errand fails to complete during cluster creation when running a PKS 1.4 installation on vSphere NSX-T. The error message outputted reveals that cluster creation starts. However, the NSX-T Container Plug-in (NCP) node-agent plugin on the worker node fails to start successfully, resulting in some of the kube-system pods becoming stuck in a “container create" state.

Environment

Cause

The NCP node-agent plugin on the worker node fails to start successfully. As a result, the following error message is outputted:

id="docs-internal-guid-af0aab83-7fff-ea8e-8814-86b26d062532">“ Cannot find OVS port for container 1d418d684a6b8eda37796ffb38e0be6dc19349b91fd348401c3a97d73b2dc048, skipped deleting”

This causes the unsuccessful start of the NCP node-agent plugin, leading to cluster creation failure.

On the underlying host, the hyperbus process status is marked “unhealthy” between the node agent and the hyperbus.

This error manifests itself in the NSX node agent, where worker VM failed to start the nsx-node-agent:

[“NCP01004"] nsx_ujo.agent.cni_watcher Unable to retrieve network info for container nsx.kube-system.coredns-54586579f6-fqjh6, network interface for it will not be configured

Debug cluster creation failure

To debug cluster creation failures, use BOSH to inspect the VMs in the cluster job for failures.

1. Use the following BOSH command to list all of the PKS cluster VMs. Check to see if any of the VMs are in a failing state:

ubuntu@opsmgr-customer0-io:~$ BOSH_CLIENT=ops_manager BOSH_CLIENT_SECRET=eh-j8ln1AyVSsmE-QvCNwEy67jg6EWXU BOSH_CA_CERT=/var/tempest/workspaces/default/root_ca_certificate BOSH_ENVIRONMENT=192.168.1.11 bosh vms

Note: In this case, the worker VM is in a failing state.

2. SSH into the worker VM, which failed in this instance, using the following BOSH SSH command:

BOSH_CLIENT=ops_manager BOSH_CLIENT_SECRET=eh-j8ln1AyVSsmE-QvCNwEy67jg6EWXU BOSH_CA_CERT=/var/tempest/workspaces/default/root_ca_certificate BOSH_ENVIRONMENT=192.168.1.11 bosh -d service-instance_6675c28d-09f4-467d-8583-75e9b9d8a448 ssh  worker/206b453e-1869-4961-972a-27b53a3f763a

3. Change to /var/vcap/sys/log/nsx-node-agent, inspect nsx-node-agent.stdout.log:

See the following message:

1 2019-05-13T14:55:29.570Z 251f9a5c-bba2-43b6-a520-5b208a00c4bc NSX 13943 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Cannot find OVS port for container 1d418d684a6b8eda37796ffb38e0be6dc19349b91fd348401c3a97d73b2dc048, skipped deleting

1 2019-05-13T15:00:29.862Z 251f9a5c-bba2-43b6-a520-5b208a00c4bc NSX 13943 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="ERROR" errorCode="NCP01004"] nsx_ujo.agent.cni_watcher Unable to retrieve network info for container nsx.kube-system.coredns-54586579f6-fqjh6, network interface for it will not be configured

1 2019-05-13T15:00:30.865Z 251f9a5c-bba2-43b6-a520-5b208a00c4bc NSX 13943 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Cannot find OVS port for container c495ecceb06272e56c7dd35a749d4a8a856f2ae6960328cdbccb912fc530f56b, skipped deleting

1 2019-05-13T15:05:31.153Z 251f9a5c-bba2-43b6-a520-5b208a00c4bc NSX 13943 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="ERROR" errorCode="NCP01004"] nsx_ujo.agent.cni_watcher Unable to retrieve network info for container nsx.kube-system.coredns-54586579f6-fqjh6, network interface for it will not be configured

1 2019-05-13T15:05:32.156Z 251f9a5c-bba2-43b6-a520-5b208a00c4bc NSX 13943 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="INFO"] nsx_ujo.agent.cni_watcher Cannot find OVS port for container 8d382a636aa37062a38e3d41c109bb401159ef657d38d03d30efc2af0930253e, skipped deleting

1 2019-05-13T15:10:32.450Z 251f9a5c-bba2-43b6-a520-5b208a00c4bc NSX 13943 - [nsx@6876 comp="nsx-container-node" subcomp="nsx_node_agent" level="ERROR" errorCode="NCP01004"] nsx_ujo.agent.cni_watcher Unable to retrieve network info for container nsx.kube-system.coredns-54586579f6-fqjh6, network interface for it will not be configured

The message above explains that the worker VM failed to start the nsx-node-agent. The NSX node agent is unable to retrieve network information for nsx.kube-system.coredns pod.

4. Start the NSX CLI on the worker node and execute the following commands:

sudo chroot /var/vcap/data/nsx-node-agent/rootfs/ nsxcli

NSX CLI (Node Agent). Press ? for command list or enter: help

8f0b9f7a-cba7-44a3-a39c-2a46dac5d538> get

 container-cache       Container cache

 container-caches      All container caches

 file                  File

 files                 Files

 node-agent-hyperbus   Connection status between node-agent and hyperbus

 node-agent-log-level  Node-agent log level

 version               System version

 
8f0b9f7a-cba7-44a3-a39c-2a46dac5d538> get node-agent-hyperbus status

HyperBus status: Unhealthy

When you run kubectl get pods -o wide --all-namespaces in the Kubernetes cluster, the pods become stuck in the “container creating” state.

Resolution

To resolve this issue, restart the underlying host. More specifically, restart the netcpad service on VMware ESXi with the following command:

/etc/init.d/netcpad restart