Pinniped post deploy job stuck in Error status and fails with BackoffLimitExceeded

search cancel

Pinniped post deploy job stuck in Error status and fails with BackoffLimitExceeded

book

Article ID: 319311

calendar_today

Updated On: 08-23-2023

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:

You are trying to create a Tanzu Kubernetes Grid management cluster
The cluster creation is successful but you observe all the pinniped-post-deploy-job-* in the pinniped-supervisor namespace in Error status as shown in the output below
kubectl get pods -n pinniped-supervisor

NAMESPACE NAME READY STATUS
pinniped-supervisor pinniped-post-deploy-job-ver-1-264sn 0/1 Error
pinniped-supervisor pinniped-post-deploy-job-ver-1-4fvkj 0/1 Error
pinniped-supervisor pinniped-post-deploy-job-ver-1-88s9q 0/1 Error
pinniped-supervisor pinniped-post-deploy-job-ver-1-b6frc 0/1 Error
pinniped-supervisor pinniped-post-deploy-job-ver-1-h4vwd 0/1 Error
pinniped-supervisor pinniped-post-deploy-job-ver-1-ks8t2 0/1 Error
pinniped-supervisor pinniped-post-deploy-job-ver-1-mxfn8 0/1 Error
pinniped-supervisor pinniped-supervisor-56fbb8cffd-65k2g 1/1 Running

Environment

VMware Tanzu Kubernetes Grid 1.x

Cause

To find the root cause of the problem you can use the following commands

kubectl get jobs -n pinniped-supervisor

NAME COMPLETIONS DURATION AGE
pinniped-post-deploy-job-ver-1 0/1 33m 33m

kubectl describe jobs pinniped-post-deploy-job-ver-1 -n pinniped-supervisor

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 30m job-controller Created pod: pinniped-post-deploy-job-ver-1-mxfn8
Normal SuccessfulCreate 19m job-controller Created pod: pinniped-post-deploy-job-ver-1-264sn
<-------TRUNCATED--------->
Warning BackoffLimitExceeded 2m38s job-controller Job has reached the specified backoff limit

From the above output, you can see that the job has exceeded its backoff limit. The back-off limit is set by default to 6. Failed pods associated with a job are recreated by the job-controller with an exponential back-off delay (10s, 20s, 40s ...) capped at 6 minutes. The back-off count is reset when a job's pod is deleted or successful without any other pods for the job failing around that time. You can read more about Jobs and back off failure policy in Kubernetes documentation.

This can happen in a scenario where your pods under the pinniped-supervisor namespace were not in a Ready status or took a long time to become healthy. Before applying the resolution to this problem make sure your pinniped-supervisor and pinniped-concierge namespaces have pods in Ready status.

Resolution

You can fix this problem by deleting the pinniped-post-deploy-job under the pinniped-supervisor namespace once you have resolved the issues with pinniped deployment. You can follow the steps below to delete the job and monitor the pinniped app's successful reconciliation.

kubectl get app -n tkg-system pinniped
NAME DESCRIPTION SINCE-DEPLOY AGE
pinniped Reconcile failed: Deploying: exit status 1 4m43s 49m

kubectl delete jobs.batch -n pinniped-supervisor pinniped-post-deploy-job
job.batch "pinniped-post-deploy-job" deleted

The app object reconciliation is done by the kapp-controller every 5 minutes so you may have to wait for some time before the reconciliation kicks off. Once the app has started reconciling you should see an out similar to the ones highlighted below

kubectl get app -n tkg-system pinniped
NAME DESCRIPTION SINCE-DEPLOY AGE
pinniped Reconciling 5s 49m

kubectl get app -n tkg-system pinniped
NAME DESCRIPTION SINCE-DEPLOY AGE
pinniped Reconcile succeeded 47s 50m

kubectl get jobs -n pinniped-supervisor
NAME COMPLETIONS DURATION AGE
pinniped-post-deploy-job-ver-1 1/1 9s 62s

kubectl get pods -n pinniped-supervisor
NAME READY STATUS RESTARTS AGE
pinniped-post-deploy-job-ver-1-fdk6n 0/1 Completed 0 70s
pinniped-supervisor-56fbb8cffd-6b8bh 1/1 Running 0 65s
pinniped-supervisor-56fbb8cffd-dkt6c 1/1 Running 0 65s

Feedback

Was this article helpful?

thumb_up Yes

thumb_down No