PKS create-cluster fails with CoreDNS reporting "Failed create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded"

Products

VMware Cloud PKS

Issue/Introduction

Symptoms:

When creating a pks cluster from command line the cluster fails with the a cluster status similar to the following:

pks cluster my-cluster

PKS Version: 1.6.0-build.17
Name: my-cluster
K8s Version: 1.15.5
Plan Name: small
UUID: 4956a7db-dc0a-44af-9b5c-98e20ab7b30f
Last Action: CREATE
Last Action State: failed
Last Action Description: Instance provisioning failed: There was a problem completing your request. Please contact your operations team providing the following information: service: p.pks, service-instance-guid: 4956a7db-dc0a-44af-9b5c-98e20ab7b30f, broker-request-id: 0d65886f-34bf-42e5-88f7-593f1f5cca08, task-id: 237, operation: create, error-message: 0 succeeded, 1 errored, 0 canceled
Kubernetes Master Host: my-cluster
Kubernetes Master Port: 8443
Worker Nodes: 1
Kubernetes Master IP(s): In Progress
Network Profile Name:

You see messages similar to the following in the debug output for the related bosh task:

bosh task 237 --debug

{"time":1576772663,"stage":"Fetching logs for apply-addons/42ad20a4-d091-4277-aa03-cd2d5de05725 (0)","tags":[],"total":1,"task":"Finding and packing log files","index":1,"state":"finished","progress":100}
', "result_output" = '{"instance":{"group":"apply-addons","id":"42ad20a4-d091-4277-aa03-cd2d5de05725"},"errand_name":"apply-addons","exit_code":1,"stdout":"Deploying /var/vcap/jobs/apply-specs/specs/coredns.yml\nserviceaccount/coredns created\nclusterrole.rbac.authorization.k8s.io/system:coredns created\nclusterrolebinding.rbac.authorization.k8s.io/system:coredns created\nconfigmap/coredns created\ndeployment.extensions/coredns created\nservice/kube-dns created\nWaiting for deployment \"coredns\" rollout to finish: 0 out of 3 new replicas have been updated...\nWaiting for deployment \"coredns\" rollout to finish: 0 of 3 updated replicas are available...\nfailed to start all system specs after 1200 with exit code 124\n","stderr":"","logs":{"blobstore_id":"2ac79808-5fce-4b4f-7ceb-380032bca801","sha1":"d66b8ba869ba484ea9fe3eed3d754dfc1e381350"}}
', "context_id" = '7c9b826b-8945-4325-be8e-cc5b26d05678' WHERE ("id" = 237)

You see messages similar to the following repeatedly in the ncp.stderr.log file:

Note: To get the NCP logs, use bosh to download a log bundle from the deployment service instance matching the failed cluster uuid. for example "bosh -d service-instance_4956a7db-dc0a-44af-9b5c-98e20ab7b30f logs" Then unpack them and review the "master.*/ncp/ncp/stderr.log"

Traceback (most recent call last):
File "/usr/local/bin/ncp", line 10, in <module>
sys.exit(main())
File "/usr/local/lib/python2.7/dist-packages/nsx_ujo/cmd/ncp.py", line 16, in main
ncp_main.start_ncp(coe)
File "/usr/local/lib/python2.7/dist-packages/nsx_ujo/ncp/main.py", line 160, in start_ncp
nsx_errors = common_utils.validate_nsx_config()
File "/usr/local/lib/python2.7/dist-packages/nsx_ujo/common/utils.py", line 796, in validate_nsx_config
ipnetwork_errors = _validate_mgr_ip_network()
File "/usr/local/lib/python2.7/dist-packages/nsx_ujo/common/utils.py", line 675, in _validate_mgr_ip_network
return _validate_ip_network(all_blocks, all_pools, external_ip_space_ids)
File "/usr/local/lib/python2.7/dist-packages/nsx_ujo/common/utils.py", line 733, in _validate_ip_network
if is_overlapped(ip_block, ip_space):
File "/usr/local/lib/python2.7/dist-packages/nsx_ujo/common/utils.py", line 757, in is_overlapped
ip_network1 = ipaddress.ip_network(obj1['cidr'])
File "/usr/lib/python2.7/dist-packages/ipaddress.py", line 186, in ip_network
return IPv4Network(address, strict)
File "/usr/lib/python2.7/dist-packages/ipaddress.py", line 1656, in __init__
raise ValueError('%s has host bits set' % self)
ValueError: 172.26.0.1/16 has host bits set

Environment

VMware PKS 1.x

Cause

The CIDR for the IP Block that PKS is using in the NSX-T IPAM has a host IP in it when expecting a network IP, for example 172.26.0.0/16.

Resolution

Update the IP Block CIDR to the network IP of the desired subnet by doing the following:

Login the OpsMan UI as admin.
Click on the PKS tile and click on networking on the left tree.
Copy the IDs for both Pods IP Block ID and Nodes IP Block ID.
Login to the NSX-T UI as admin.
Navigate to the Advanced Network & Security tab and click "IPAM" in the tree to the left under the "Networking" section.
Click each Block IPs and match the IDs copied in Step 3.
For each Block IP that matches the ID click the pencil icon in the top left of the menu to bring up the CIDR for the Block IP. Change the IP to the subnet's network IP. A subnet calculator can be used online to determine the network IP for each subnet.
From the pks cli delete the failed cluster and re-create it.