While creating or upgrading clusters in TKGI, bosh
runs apply-addons
errand which deploys coredns
and other required add-ons required by TKGI. It rolls out a coredns
deployment object at the Kubernetes level. The coredns deployment fails if there is an error or misconfiguration. This procedure details the steps to find the root cause of this type of failure.
All Versions of VMware Tanzu Kubernetes Grid Integrated Edition
Few common causes of this type of failure are listed below. This is not an exhaustive list and there may be scenarios depending on environment configuration that needs further troubleshooting.
User encounters the error below when this issue occurs
"coredns\" created
Waiting for rollout to finish: 0 of 1 updated replicas are available...
failed to start all system specs after 1200 with exit code
When cluster creation fails find the BOSH task that failed using the following command:
tkgi cluster one_worker
Name: one_worker
Plan Name: small
UUID: ########-#####-####-#####-##########
Last Action: CREATE
Last Action State: failed
Last Action Description: Instance provisioning failed: There was a problem completing your request. Please contact your operations team providing the following information: service: p.pks, service-instance-guid: ########-#####-####-#####-##########, broker-request-id: ########-#####-####-#####-##########, task-id: 1667, operation: create
Kubernetes Master Host: one_worker
Kubernetes Master Port: 8443
Worker Nodes: 1
Kubernetes Master IP(s): In Progress
Bosh task fails with "failed to start all system specs after 1200
When a cluster upgrade fails, the BOSH task that failed can be found in the Apply Changes
log from the Operations (Ops) Manager UI.
[upgrade-all-service-instances] 2018/08/10 ##:##:##.####### FINISHED UPGRADES Status: FAILED; Summary: Number of successful upgrades: 0; Number of CF service instance orphans detected: 0; Number of deleted instances before upgrade could occur: 0; Number of busy instances which could not be upgraded: 0; Number of service instances that failed to upgrade: 1 [########-#####-####-#####-##########]
[upgrade-all-service-instances] 2018/08/10 ##:##:##.###### [########-#####-####-#####-##########] Upgrade failed: bosh task id 149: Failed for bosh task: 165
Once the BOSH task is identified, the command bosh task <task-id> --debug
can be used to get a better understanding of the cause of the error. In the debug task logs, look for the below error message. This error message indicates the coredns deployment rollout has failed.
{"time":1531003250,"stage":"Fetching logs for apply-addons/########-#####-####-#####-########## (0)","tags":[],"total":1,"task":"Finding and packing log files","index":1,"state":"finished","progress":100}
', "result_output" = '{"instance":{"group":"apply-addons","id":"########-#####-####-#####-##########"},"errand_name":"apply-addons","exit_code":1,
"stdout":"Deploying /var/vcap/jobs/apply-specs/specs/coredns.yml\nservice \"kube-dns\" created\nserviceaccount \"coredns\" created\nconfigmap \"coredns\" created\nconfigmap \"coredns\" created\ndeployment.extensions \"coredns\" created
Waiting for rollout to finish: 0 of 1 updated replicas are available...\n failed to start all system specs after 1200 with exit code 1\n",
"stderr":"error: deployment \"coredns\" exceeded its progress deadline\n","logs":{"blobstore_id":"########-#####-####-#####-##########","sha1":"################################"}}
Although the cluster creation or upgrade has failed, tkgi get-credentials <cluster-name>
will still work since the Kubernetes API server has started successfully. After this command, the kubectl
CLI can be used to troubleshoot deployment rollout failures.
If for any reason tkgi get-credentials
is not successful during a failed cluster operation, an alternate way to get access to the kubectl CLI is by following the process detailed here.
To find out which pods are failing, use the command, kubectl get pods -o wide --all-namespaces.
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
kube-system coredns-#######-##### 0/3 CrashLoopBackOff 28 31m ###.##.##.# ########-#####-####-#####-##############
The NODE column from the output above can be used to trace which worker VM is hosting the failed pod and its corresponding logs. Use the output of kubectl get nodes
to map NAME
in the below output to NODE ID.
NAME STATUS ROLES AGE VERSION
########-#####-####-#####-########## Ready <none> 7d v1.18.1
########-#####-####-#####-########## Ready <none> 17d v1.18.1
########-#####-####-#####-########## Ready <none> 17d v1.18.1
Once the worker node hosting the failed pod has been identified, there are a few ways to get the necessary information to trace the failure.
The logs are present at two locations on the worker VM, /var/log/pods
and /var/log/containers
. They point to the same logs but are aggregated at the pod and container level respectively. /var/log/containers
has a more readable file naming convention.
Alternatively, kubectl CLI can be used to debug pod failures. Refer to debugging pods for a more exhaustive list of information to collect. However, the following two commands should provide enough information to get started.
Describe current pod
kubectl describe pod coredns-########-##### -n kube-system
Logs for container inside coredns pod
kubectl logs -n kube-system -l k8s-app=kube-dns
If the logs and troubleshooting so far has not yielded information to resolve this problem, there is a need to dive a little bit deeper. The rollout failures are mostly likely because the underlying CNI layer is not healthy. Depending on which CNI plugin is in, use the following steps:
1. If NSX Container Plug-In (NCP) is configured on a High Availability (HA) setup, determine the NCP master using the command below.
Note: If this is a single master cluster, proceed to the next step.
bosh ssh -d service-instance_<UUID> master -c "sudo /var/vcap/jobs/ncp/bin/nsxcli -c get ncp-master status" | grep "This instance is the NCP master"
2. Check the NCP logs on the master VM identified in the step above.
master/########-#####-####-#####-##########:~# cd /var/vcap/sys/log/ncp
-rw-r--r-- 1 root root 47M Mar 28 07:47 ncp.stdout.log
-rw-r--r-- 1 root root 3.1M Mar 30 19:06 ncp.stderr.log
3. Check if the hyperbus is unhealthy on the worker nodes.
bosh ssh -d service-instance_<UUID> worker -c "sudo /var/vcap/jobs/nsx-node-agent/bin/nsxcli -c get node-agent-hyperbus status" | grep -i unhealthy
4. ssh to the worker node identified above and check the nsx-node-agent logs.
worker/<node-name>:~# cd /var/vcap/sys/log/nsx-node-agent/
1. Check flanneld
is running on master and worker VMs.
bosh is --ps -d service-instance_<UUID> | grep flanneld master/########-#####-####-#####-########## flanneld running - - worker/########-#####-####-#####-########## flanneld running - - worker/########-#####-####-#####-########## flanneld running - -
2. Check the logs on the master or worker VM where flanneld
is failing. Also, check the flanneld
logs on the worker vm where the coredns pod is deployed.
cd /var/vcap/sys/log/flanneld ls -lrth total 24K -rw-r--r-- 1 vcap vcap 0 Mar 21 16:45 flanneld.stdout.log -rw-r--r-- 1 vcap vcap 726 Mar 29 19:13 flanneld_ctl.stdout.log -rw-r--r-- 1 vcap vcap 8.3K Mar 29 19:13 flanneld_ctl.stderr.log -rw-r--r-- 1 vcap vcap 7.6K Mar 30 18:13 flanneld.stderr.log
The above process will help discover the cause of the coredns rollout to failure. There can be multiple scenarios, both in NSX-T and flannel environments, that can cause these failures. Detailing all the scenarios are out of scope of this article but this article will act as a master list for such scenarios. Reference scenarios below based on which coredns deployment rollout has failed in the past.