Error: "0 succeeded, 1 errored, 0 canceled" for apply-addon errand during upgrade-cluster

Products

VMware Tanzu Kubernetes Grid Integrated Edition VMware Tanzu Kubernetes Grid Integrated Edition (Core) VMware Tanzu Kubernetes Grid Integrated Edition 1.x VMware Tanzu Kubernetes Grid Integrated EditionStarter Pack (Core)

Issue/Introduction

tkgi upgrade-cluster CLUSTER_NAME fails

tkgi cluster CLUSTER_NAME reports

UUID: CLUSTER_UUID
Last Action: UPGRADE
Last Action State: failed
Last Action Description: Failed for bosh task: TASK_ID, error-message: 0 succeeded, 1 errored, 0 canceled

OR

UUID: CLUSTER_UUID
Last Action: UPGRADE
Last Action State: failed
Last Action Description: Instance update failed: There was a problem completing your request. Please contact your operations team providing the following information: service: p.pks, service-instance-guid: CLUSTER_UUID, broker-request-id: SOME_UUID, task-id: TASK_ID, operation: update, error-message: 0 succeeded, 1 errored, 0 canceled

AND

Output of: bosh task TASK_ID

Shows it was the apply-addons errand task that failed with error:

0 succeeded, 1 errored, 0 canceled

Output of: bosh task TASK_ID --debug shows that the metrics-server was the issue.

"result_output" = '{"instance":{"group":"apply-addons","id":"<APPLY_ADDONS_ID>"},"errand_name":"apply-addons","exit_code":1,"stdout":"No need to change the CoreDNS replica because there are 30 linux worker nodes\nDeploying /var/vcap/jobs/apply-specs/specs/coredns.yml\nserviceaccount/coredns unchanged\nclusterrole.rbac.authorization.k8s.io/system:coredns unchanged\nclusterrolebinding.rbac.authorization.k8s.io/system:coredns unchanged\nconfigmap/coredns unchanged\ndeployment.apps/coredns unchanged\nservice/kube-dns unchanged\ndeployment \"coredns\" successfully rolled out\nDeploying /var/vcap/jobs/apply-specs/specs/metrics-server/\nclusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator unchanged\nrolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader unchanged\napiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io unchanged\nserviceaccount/metrics-server unchanged\ndeployment.apps/metrics-server unchanged\nservice/metrics-server unchanged\nclusterrole.rbac.authorization.k8s.io/system:metrics-server unchanged\nclusterrolebinding.rbac.authorization.k8s.io/system:metrics-server unchanged\nsecret/metrics-server-certs unchanged\nWaiting for deployment \"metrics-server\" rollout to finish: 0 of 1 updated replicas are available...\nfailed to start all system specs after 1200 with exit code 124\n"

Environment

TKGI 1.18.3

Cause

There is a problem with the metrics-server POD.

metrics-server must be running on the cluster before cluster upgrade cluster begins.

Checking the current running metrics-server POD confirms it is failing:

kubectl get pods -A -o wide | grep metrics-server

metrics-server showed had been in failed state due to unable to pull the metrics-server container image.

metrics-server container image comes by default with the TKGI stemcell

The cause in this scenario was a worker node ran out of resources, such as storage. Therefore, container images get purged or lost.

A failing metrics-server pod can be caused by a infrastructure or cluster resource issue

Resolution

Resolve the failing metrics-server pod issue.

OR

Rebuild the node via the bosh stemcell to apply back the default container images.

bosh -d service-instance_CLUSTERUUID recreate worker/WORKERNODE_UUID --no-converge

OR

If the issue is only missing images, you can reload the system images manually using KB 298569 and steps mentioned.

Then upgrade the cluster with:

tkgi upgrade-cluster CLUSTER_NAME