PKS create cluster failed with unable to recognize coredns spec while running apply addon errand

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

When using PKS 1.5.1-build.8 during upgrades, apply-addons errand starts failing.

Here is the error details.


# pks create cluster failed

[root@ovrjbxlandc1prd ~]# pks clusters

PKS Version Name k8s Version Plan Name UUID Status Action
1.5.1-build.8 test1 1.14.6 small a22b58ed-0223-4006-8066-61a51d41bef0 failed UPGRADE

# pks create cluster failed details

[root@ovrjbxlandc1prd ~]# pks cluster test1

PKS Version: 1.5.1-build.8
Name: test1
K8s Version: 1.14.6
Plan Name: small
UUID: a22b58ed-0223-4006-8066-61a51d41bef0
Last Action: UPGRADE
Last Action State: failed
Last Action Description: Failed for bosh task: 115957, error-message: 0 succeeded, 1 errored, 0 canceled
Kubernetes Master Host: test1.local
Kubernetes Master Port: 8443
Worker Nodes: 3
Kubernetes Master IP(s): 10.52.12.25
Network Profile Name: 


# failed task 115957 debug

bb task 115957 --debug

{"time":1571729153,"stage":"Running errand","tags":[],"total":1,"task":"apply-addons/799f5517-6f4c-445a-a33d-28beaa73d439 (0)","index":1,"state":"finished","progress":100}
{"time":1571729153,"stage":"Fetching logs for apply-addons/799f5517-6f4c-445a-a33d-28beaa73d439 (0)","tags":[],"total":1,"task":"Finding and packing log files","index":1,"state":"started","progress":0}
{"time":1571729154,"stage":"Fetching logs for apply-addons/799f5517-6f4c-445a-a33d-28beaa73d439 (0)","tags":[],"total":1,"task":"Finding and packing log files","index":1,"state":"finished","progress":100}
', "result_output" = '{"instance":{"group":"apply-addons","id":"799f5517-6f4c-445a-a33d-28beaa73d439"},"errand_name":"apply-addons","exit_code":1,"stdout":"Deploying /var/vcap/jobs/apply-specs/specs/coredns.yml\n
failed to start all system specs after 1200 with exit code 1\n","stderr":"unable to recognize \"/var/vcap/jobs/apply-specs/specs/coredns.yml\": 
Get https://master.k8s.internal:8443/api?timeout=32s: dial tcp: lookup master.k8s.internal on 169.254.0.2:53: no such host\nunable to recognize \
"/var/vcap/jobs/apply-specs/specs/coredns.yml\": Get https://master.k8s.internal:8443/api?timeout=32s: dial tcp: lookup master.k8s.internal on 169.254.0.2:53: 
no such host\nunable to recognize \"/var/vcap/jobs/apply-specs/specs/coredns.yml\": Get https://master.k8s.internal:8443/api?timeout=32s: dial tcp: lookup master.k8s.internal on 169.254.0.2:53:
no such host\nunable to recognize \"/var/vcap/jobs/apply-specs/specs/coredns.yml\": Get https://master.k8s.internal:8443/api?timeout=32s: dial tcp: lookup master.k8s.internal on 169.254.0.2:53:
no such host\nunable to recognize \"/var/vcap/jobs/apply-specs/specs/coredns.yml\": Get https://master.k8s.internal:8443/api?timeout=32s: dial tcp: lookup master.k8s.internal on 169.254.0.2:53:
no such host\nunable to recognize \"/var/vcap/jobs/apply-specs/specs/coredns.yml\": Get https://master.k8s.internal:8443/api?timeout=32s: dial tcp: lookup master.k8s.internal on 169.254.0.2:53:
no such host\n","logs":{"blobstore_id":"3f640dd6-1b4f-4c74-6da8-33289cfeaac0","sha1":"6fcbae029454fe59705422c781011f689f88f054"}}
', "context_id" = '8a8c3158-2399-4d9f-93fe-5352941eb2fe' WHERE ("id" = 115957)
D, [2019-10-22T07:26:09.986885 #6032] [task:115957] DEBUG -- DirectorJobRunner: (0.000588s) (conn: 47322485367800) COMMIT
I, [2019-10-22T07:26:09.987040 #6032] [] INFO -- DirectorJobRunner: Task took 1 minute 57.806051812999996 seconds to process.

Environment

Product Version: 1.5

Resolution

There is a chance that a record without a "version" or "crerated_at" can be created in LocalDnsBlob in the BOSH Director and if this record is interacted with, the Blobstore fails.

If for some reason this happens, the ".latest" call for that object will always return the record with the nil version. This then causes the sync DNS scheduler to not work properly.

The sync_dns job is crashing when BOSH creates a new VM. In sync_dns.stdout.log, the following output is seen:

ERROR -- Director: Shutting down bosh-director-sync-dns: Thread terminated

Tail the log, this message comes up a couple of times within the span of a minute or so when you create the errand VM.

monit summary also shows that the process goes from "running" to "not monitored" and back a couple of times. Run monit summary from the BOSH Director console and check the status of the output.

In this case, the current location of the console is: /var/vcap/jobs/director/bin/console

Bosh::Director::Models::LocalDnsBlob.latest
<Bosh::Director::Models::LocalDnsBlob @values={:id=>5418, :blob_id=>nil, :version=>nil, :created_at=>nil, :records_version=>0, :aliases_version=>0}>

Check the output and compare to the following:

Bosh::Director::Models::LocalDnsBlob.last
<Bosh::Director::Models::LocalDnsBlob @values={:id=>6042, :blob_id=>6041, :version=>6042, :created_at=>2019-10-22 20:22:12 UTC, :records_version=>6110, :aliases_version=>0}>

Bosh::Director::Models::LocalDnsBlob.latest
<Bosh::Director::Models::LocalDnsBlob @values={:id=>5418, :blob_id=>nil, :version=>nil, :created_at=>nil, :records_version=>0, :aliases_version=>0}>

To fix the problem, update the latest value with the following command, where version field is updated to match the id:

Bosh::Director::Models::LocalDnsBlob.find(id: 5418).update(version: 5418)