tkgi upgrade-cluster CLUSTER_NAME
fails
tkgi cluster CLUSTER_NAME
reportsUUID: CLUSTER_UUID
Last Action: UPGRADE
Last Action State: failed
Last Action Description: Failed for bosh task: TASK_ID, error-message: 0 succeeded, 1 errored, 0 canceled
OR
UUID: CLUSTER_UUID
Last Action: UPGRADE
Last Action State: failed
Last Action Description: Instance update failed: There was a problem completing your request. Please contact your operations team providing the following information: service: p.pks, service-instance-guid: CLUSTER_UUID, broker-request-id: SOME_UUID, task-id: TASK_ID, operation: update, error-message: 0 succeeded, 1 errored, 0 canceled
AND
bosh task TASK_ID
Shows it was the apply-addons errand task that failed with error:
0 succeeded, 1 errored, 0 canceled
bosh task TASK_ID --debug
shows that the metrics-server was the issue. "result_output" = '{"instance":{"group":"apply-addons","id":"<APPLY_ADDONS_ID>"},"errand_name":"apply-addons","exit_code":1,"stdout":"No need to change the CoreDNS replica because there are 30 linux worker nodes\nDeploying /var/vcap/jobs/apply-specs/specs/coredns.yml\nserviceaccount/coredns unchanged\nclusterrole.rbac.authorization.k8s.io/system:coredns unchanged\nclusterrolebinding.rbac.authorization.k8s.io/system:coredns unchanged\nconfigmap/coredns unchanged\ndeployment.apps/coredns unchanged\nservice/kube-dns unchanged\ndeployment \"coredns\" successfully rolled out\nDeploying /var/vcap/jobs/apply-specs/specs/metrics-server/\nclusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator unchanged\nrolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader unchanged\napiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io unchanged\nserviceaccount/metrics-server unchanged\ndeployment.apps/metrics-server unchanged\nservice/metrics-server unchanged\nclusterrole.rbac.authorization.k8s.io/system:metrics-server unchanged\nclusterrolebinding.rbac.authorization.k8s.io/system:metrics-server unchanged\nsecret/metrics-server-certs unchanged\nWaiting for deployment \"metrics-server\" rollout to finish: 0 of 1 updated replicas are available...\nfailed to start all system specs after 1200 with exit code 124\n"
kubectl get pods -A -o wide | grep metrics-server
OR
bosh -d service-instance_CLUSTERUUID recreate worker/WORKERNODE_UUID --no-converge
OR
tkgi upgrade-cluster CLUSTER_NAME