Error: "0 succeeded, 1 errored, 0 canceled" for apply-addon errand during upgrade-cluster
search cancel

Error: "0 succeeded, 1 errored, 0 canceled" for apply-addon errand during upgrade-cluster

book

Article ID: 375136

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition VMware Tanzu Kubernetes Grid Integrated Edition (Core) VMware Tanzu Kubernetes Grid Integrated Edition 1.x VMware Tanzu Kubernetes Grid Integrated EditionStarter Pack (Core)

Issue/Introduction

  • tkgi upgrade-cluster CLUSTER_NAME fails

 

  • tkgi cluster CLUSTER_NAME reports

UUID:                     CLUSTER_UUID
Last Action:              UPGRADE
Last Action State:        failed
Last Action Description:  Failed for bosh task: TASK_ID, error-message: 0 succeeded, 1 errored, 0 canceled

 

          OR

 

UUID:                     CLUSTER_UUID
Last Action:              UPGRADE
Last Action State:        failed
Last Action Description: Instance update failed: There was a problem completing your request. Please contact your operations team providing the following information: service: p.pks, service-instance-guid: CLUSTER_UUID, broker-request-id: SOME_UUID, task-id: TASK_ID, operation: update, error-message: 0 succeeded, 1 errored, 0 canceled

 

AND

  • Output of: bosh task TASK_ID 

Shows it was the apply-addons errand task that failed with error:

0 succeeded, 1 errored, 0 canceled

 

  • Output of: bosh task TASK_ID --debug shows that the metrics-server was the issue.  

"result_output" = '{"instance":{"group":"apply-addons","id":"<APPLY_ADDONS_ID>"},"errand_name":"apply-addons","exit_code":1,"stdout":"No need to change the CoreDNS replica because there are 30 linux worker nodes\nDeploying /var/vcap/jobs/apply-specs/specs/coredns.yml\nserviceaccount/coredns unchanged\nclusterrole.rbac.authorization.k8s.io/system:coredns unchanged\nclusterrolebinding.rbac.authorization.k8s.io/system:coredns unchanged\nconfigmap/coredns unchanged\ndeployment.apps/coredns unchanged\nservice/kube-dns unchanged\ndeployment \"coredns\" successfully rolled out\nDeploying /var/vcap/jobs/apply-specs/specs/metrics-server/\nclusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator unchanged\nrolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader unchanged\napiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io unchanged\nserviceaccount/metrics-server unchanged\ndeployment.apps/metrics-server unchanged\nservice/metrics-server unchanged\nclusterrole.rbac.authorization.k8s.io/system:metrics-server unchanged\nclusterrolebinding.rbac.authorization.k8s.io/system:metrics-server unchanged\nsecret/metrics-server-certs unchanged\nWaiting for deployment \"metrics-server\" rollout to finish: 0 of 1 updated replicas are available...\nfailed to start all system specs after 1200 with exit code 124\n"

 

 

Environment

  • TKGI 1.18.3

Cause

  • There is a problem with the metrics-server POD. 

 

  • metrics-server must be running on the cluster before cluster upgrade cluster begins.

 

  • Checking the current running metrics-server POD confirms it is failing:

kubectl get pods -A -o wide | grep  metrics-server 

 

  • metrics-server showed had been in failed state due to unable to pull the metrics-server container image.

 

  • metrics-server container image comes by default with the TKGI stemcell

 

  • The cause in this scenario was a worker node ran out of resources, such as storage.  Therefore, container images get purged or lost.

 

  • A failing metrics-server pod can be caused by a infrastructure or cluster resource issue

Resolution

  • Resolve the failing metrics-server pod issue. 

OR

  • Rebuild the node via the bosh stemcell to apply back the default container images.

bosh -d service-instance_CLUSTERUUID recreate worker/WORKERNODE_UUID --no-converge

 

OR

  • If the issue is only missing images, you can reload the system images manually using KB 298569 and steps mentioned.

 

  • Then upgrade the cluster with:

tkgi upgrade-cluster CLUSTER_NAME