Troubleshooting coredns rollout failure "Waiting for rollout to finish: 0 of 1 updated replicas are available"

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

While creating or upgrading clusters in TKGI, bosh runs apply-addons errand which deploys coredns and other required add-ons required by TKGI. It rolls out a coredns deployment object at the Kubernetes level. The coredns deployment fails if there is an error or misconfiguration. This procedure details the steps to find the root cause of this type of failure.

Environment

All Versions of VMware Tanzu Kubernetes Grid Integrated Edition

Cause

Few common causes of this type of failure are listed below. This is not an exhaustive list and there may be scenarios depending on environment configuration that needs further troubleshooting.

CNI plugin in the cluster is not healthy
Unavailability of the Kubernetes control plane
Packet loss, packet drops or issues with underlay network connecting the K8s nodes
Unhealthy node agent hyperbus on K8s worker nodes
Pod IP pool exhaustions
External SNAT Floating IP pool exhaustion

User encounters the error below when this issue occurs

"coredns\" created
Waiting for rollout to finish: 0 of 1 updated replicas are available...
failed to start all system specs after 1200 with exit code

Resolution

Troubleshooting Steps

Find out the failed BOSH task

When cluster creation fails find the BOSH task that failed using the following command:

tkgi cluster one_worker

Name:                     one_worker
Plan Name:                small
UUID:                     ########-#####-####-#####-##########
Last Action:              CREATE
Last Action State:        failed

Last Action Description:  Instance provisioning failed: There was a problem completing your request. Please contact your operations team providing the following information: service: p.pks, service-instance-guid: ########-#####-####-#####-##########, broker-request-id: ########-#####-####-#####-##########, task-id: 1667, operation: create

Kubernetes Master Host:   one_worker
Kubernetes Master Port:   8443
Worker Nodes:             1
Kubernetes Master IP(s):  In Progress
Bosh task fails with "failed to start all system specs after 1200

When a cluster upgrade fails, the BOSH task that failed can be found in the Apply Changes log from the Operations (Ops) Manager UI.

[upgrade-all-service-instances] 2018/08/10 ##:##:##.####### FINISHED UPGRADES Status: FAILED; Summary: Number of successful upgrades: 0; Number of CF service instance orphans detected: 0; Number of deleted instances before upgrade could occur: 0; Number of busy instances which could not be upgraded: 0; Number of service instances that failed to upgrade: 1 [########-#####-####-#####-##########]  

[upgrade-all-service-instances] 2018/08/10 ##:##:##.###### [########-#####-####-#####-##########] Upgrade failed: bosh task id 149: Failed for bosh task: 165

What to look for in BOSH debug logs?

Once the BOSH task is identified, the command bosh task <task-id> --debug can be used to get a better understanding of the cause of the error. In the debug task logs, look for the below error message. This error message indicates the coredns deployment rollout has failed.

{"time":1531003250,"stage":"Fetching logs for apply-addons/########-#####-####-#####-########## (0)","tags":[],"total":1,"task":"Finding and packing log files","index":1,"state":"finished","progress":100}
', "result_output" = '{"instance":{"group":"apply-addons","id":"########-#####-####-#####-##########"},"errand_name":"apply-addons","exit_code":1,
"stdout":"Deploying /var/vcap/jobs/apply-specs/specs/coredns.yml\nservice \"kube-dns\" created\nserviceaccount \"coredns\" created\nconfigmap \"coredns\" created\nconfigmap \"coredns\" created\ndeployment.extensions \"coredns\" created
Waiting for rollout to finish: 0 of 1 updated replicas are available...\n failed to start all system specs after 1200 with exit code 1\n",
"stderr":"error: deployment \"coredns\" exceeded its progress deadline\n","logs":{"blobstore_id":"########-#####-####-#####-##########","sha1":"################################"}}

Where to find the logs for failed coredns deployment rollout?

Although the cluster creation or upgrade has failed, tkgi get-credentials <cluster-name> will still work since the Kubernetes API server has started successfully. After this command, the kubectl CLI can be used to troubleshoot deployment rollout failures.

If for any reason tkgi get-credentials is not successful during a failed cluster operation, an alternate way to get access to the kubectl CLI is by following the process detailed here.

Which pods are failing?

To find out which pods are failing, use the command, kubectl get pods -o wide --all-namespaces.

NAMESPACE            NAME                       READY     STATUS             RESTARTS   AGE      IP             NODE
kube-system          coredns-#######-#####   0/3       CrashLoopBackOff   28         31m      ###.##.##.#    ########-#####-####-#####-##############

Which worker VM hosts the failed pod?

The NODE column from the output above can be used to trace which worker VM is hosting the failed pod and its corresponding logs. Use the output of kubectl get nodes to map NAME in the below output to NODE ID.

NAME                                   STATUS    ROLES     AGE       VERSION 
########-#####-####-#####-##########   Ready     <none>    7d        v1.18.1
########-#####-####-#####-##########   Ready     <none>    17d       v1.18.1
########-#####-####-#####-##########   Ready     <none>    17d       v1.18.1

Where to find the failed pods or containers log?

Once the worker node hosting the failed pod has been identified, there are a few ways to get the necessary information to trace the failure.

The logs are present at two locations on the worker VM, /var/log/pods and /var/log/containers. They point to the same logs but are aggregated at the pod and container level respectively. /var/log/containers has a more readable file naming convention.

Alternatively, kubectl CLI can be used to debug pod failures. Refer to debugging pods for a more exhaustive list of information to collect. However, the following two commands should provide enough information to get started.

Describe current pod
kubectl describe pod coredns-########-##### -n kube-system

Logs for container inside coredns pod
kubectl logs -n kube-system -l k8s-app=kube-dns

What else could have gone wrong?

If the logs and troubleshooting so far has not yielded information to resolve this problem, there is a need to dive a little bit deeper. The rollout failures are mostly likely because the underlying CNI layer is not healthy. Depending on which CNI plugin is in, use the following steps:

NSX-T

1. If NSX Container Plug-In (NCP) is configured on a High Availability (HA) setup, determine the NCP master using the command below.

Note: If this is a single master cluster, proceed to the next step.

bosh ssh -d service-instance_<UUID> master -c "sudo /var/vcap/jobs/ncp/bin/nsxcli -c get ncp-master status" | grep "This instance is the NCP master"

2. Check the NCP logs on the master VM identified in the step above.

master/########-#####-####-#####-##########:~# cd /var/vcap/sys/log/ncp

-rw-r--r-- 1 root root  47M Mar 28 07:47 ncp.stdout.log
-rw-r--r-- 1 root root 3.1M Mar 30 19:06 ncp.stderr.log

3. Check if the hyperbus is unhealthy on the worker nodes.

bosh ssh -d service-instance_<UUID> worker -c "sudo /var/vcap/jobs/nsx-node-agent/bin/nsxcli -c get node-agent-hyperbus status" | grep -i unhealthy

4. ssh to the worker node identified above and check the nsx-node-agent logs.

worker/<node-name>:~# cd /var/vcap/sys/log/nsx-node-agent/

Flannel

1. Check flanneld is running on master and worker VMs.

bosh is --ps -d service-instance_<UUID> | grep flanneld

master/########-#####-####-#####-##########             flanneld                       	running	-  	-
worker/########-#####-####-#####-##########             flanneld                       	running	-  	-
worker/########-#####-####-#####-##########             flanneld                       	running	-  	-

2. Check the logs on the master or worker VM where flanneld is failing. Also, check the flanneld logs on the worker vm where the coredns pod is deployed.

cd /var/vcap/sys/log/flanneld

ls -lrth
total 24K
-rw-r--r-- 1 vcap vcap    0 Mar 21 16:45 flanneld.stdout.log
-rw-r--r-- 1 vcap vcap  726 Mar 29 19:13 flanneld_ctl.stdout.log
-rw-r--r-- 1 vcap vcap 8.3K Mar 29 19:13 flanneld_ctl.stderr.log
-rw-r--r-- 1 vcap vcap 7.6K Mar 30 18:13 flanneld.stderr.log

Next Steps

The above process will help discover the cause of the coredns rollout to failure. There can be multiple scenarios, both in NSX-T and flannel environments, that can cause these failures. Detailing all the scenarios are out of scope of this article but this article will act as a master list for such scenarios. Reference scenarios below based on which coredns deployment rollout has failed in the past.