PKS cluster creation failed with kube-dns errors.
In the bosh task xx --debug logs, you see the entries similar to:
deployment.extensions/kube-dns created
Waiting for deployment \"kube-dns\" rollout to finish: 0 of 1 updated replicas are available...
failed to start all system specs after 1200 with exit code 1\n","stderr":"error: deployment \"kube-dns\" exceeded its progress deadline\n","logs":{"blobstore_id":"7bf95a81-a6ce-4149-5111-4e9677ea405b","sha1":"e43a300328d23ab2395c4a36e75ee237b8f2de00"}}
While creating or upgrading clusters on PKS, BOSH runs apply-addons errand which deploys a bunch of add-ons and kube-dns. The kube-dns job rolls out a kube-dns deployment object at Kubernetes level. The kube-dns deployment fails if there is an error or misconfiguration in the underlying overlay networking solution.
To resolve this issue, verify the below options. Please note that these are the possible but not limited options, which cause the cluster creation failure. You need to check the logs and troubleshoot further.
Validate that there are no conflicting IP ranges present under IP Pools, when you go to Inventory from NSX-T Manager
Ensure that you have not used underscore in --external-hostname parameter when you run pks create cluster command.
Validate the NSX-T components health status from NSX-T Manager and confirm all the components are in healthy state.
Validate that all components are set for the NSX-T required configuration MTU size of 1600, including the DVS switches and physical network.
Verifiy that hyperbus interface exists on ESXi hosts by running the command:
esxcfg-vmknic -l
Verify the hyperbus agent health status on the ESXi host by following:
Connect to ESXi as root and run nsxcli to switch to nsxcli.
Run get hyperbus connection info to check the hyperbus health status.
Verify that you have selected the Enable Post Deploy Scripts option under Bosh tile, which will run some scripts to deploy K8S add-ons.
Log in to K8S master & worker nodes and confirm that all services are running:
Run bosh vms to get the deployment and node vm details.
Run bosh ssh <vm-name> -d <deployment-name> to connect to master/worker node.
Run sudo -i to switch to sudo mode.
Run monit summary to check the status of all services.
Review the /var/vcap/sys/log/ncp/ncp.stdout.log on master node to see if there are any network connectivity issues.
Review the below logs on worker nodes :
kubelet.*.log available under /var/vcap/sys/log/kubelet.
nsx-node-agent.*.log available under /var/vcap/sys/log/nsx-node-agent.
For more information, refer the below Pivotal articles:
How to troubleshoot kube-dns rollout failure when apply addons errand fails to start all system specs after 1200s
PKS-cluster-creation-fails-with-underscore
PKS-cluster-creation-in-NSX-T-fails-conflicting-cidr
PKS-cluster-creation-in-NSX-T-fails-as-all-available-IPs-in-external-SNAT-IP-pools-are-exhausted