Pinniped post deploy job stuck in Error status and fails with BackoffLimitExceeded
search cancel

Pinniped post deploy job stuck in Error status and fails with BackoffLimitExceeded

book

Article ID: 319311

calendar_today

Updated On: 08-23-2023

Products

VMware Tanzu Kubernetes Grid

Issue/Introduction

Symptoms:
  • You are trying to create a Tanzu Kubernetes Grid management cluster
  • The cluster creation is successful but you observe all the pinniped-post-deploy-job-* in the pinniped-supervisor namespace in Error status as shown in the output below

    kubectl get pods -n pinniped-supervisor

    NAMESPACE            NAME                                   READY   STATUS 
    pinniped-supervisor  pinniped-post-deploy-job-ver-1-264sn   0/1     Error  
    pinniped-supervisor  pinniped-post-deploy-job-ver-1-4fvkj   0/1     Error  
    pinniped-supervisor  pinniped-post-deploy-job-ver-1-88s9q   0/1     Error  
    pinniped-supervisor  pinniped-post-deploy-job-ver-1-b6frc   0/1     Error  
    pinniped-supervisor  pinniped-post-deploy-job-ver-1-h4vwd   0/1     Error  
    pinniped-supervisor  pinniped-post-deploy-job-ver-1-ks8t2   0/1     Error  
    pinniped-supervisor  pinniped-post-deploy-job-ver-1-mxfn8   0/1     Error  
    pinniped-supervisor  pinniped-supervisor-56fbb8cffd-65k2g   1/1     Running



Environment

VMware Tanzu Kubernetes Grid 1.x

Cause

To find the root cause of the problem you can use the following commands

kubectl get jobs -n pinniped-supervisor

NAME                             COMPLETIONS   DURATION   AGE
pinniped-post-deploy-job-ver-1   0/1           33m        33m

kubectl describe jobs pinniped-post-deploy-job-ver-1 -n pinniped-supervisor

Events:
  Type     Reason                Age    From            Message
  ----     ------                ----   ----            -------
  Normal   SuccessfulCreate      30m    job-controller  Created pod: pinniped-post-deploy-job-ver-1-mxfn8
  Normal   SuccessfulCreate      19m    job-controller  Created pod: pinniped-post-deploy-job-ver-1-264sn
<-------TRUNCATED--------->
  Warning  BackoffLimitExceeded  2m38s  job-controller  Job has reached the specified backoff limit
 

From the above output, you can see that the job has exceeded its backoff limit. The back-off limit is set by default to 6. Failed pods associated with a job are recreated by the job-controller with an exponential back-off delay (10s, 20s, 40s ...) capped at 6 minutes. The back-off count is reset when a job's pod is deleted or successful without any other pods for the job failing around that time. You can read more about Jobs and back off failure policy in Kubernetes documentation.

This can happen in a scenario where your pods under the pinniped-supervisor namespace were not in a Ready status or took a long time to become healthy. Before applying the resolution to this problem make sure your pinniped-supervisor and pinniped-concierge namespaces have pods in Ready status.

Resolution

You can fix this problem by deleting the pinniped-post-deploy-job under the pinniped-supervisor namespace once you have resolved the issues with pinniped deployment. You can follow the steps below to delete the job and monitor the pinniped app's successful reconciliation.
 

kubectl get app -n tkg-system pinniped
NAME       DESCRIPTION                                  SINCE-DEPLOY   AGE
pinniped   Reconcile failed: Deploying: exit status 1   4m43s          49m

kubectl delete jobs.batch -n pinniped-supervisor pinniped-post-deploy-job
job.batch "pinniped-post-deploy-job" deleted


The app object reconciliation is done by the kapp-controller every 5 minutes so you may have to wait for some time before the reconciliation kicks off. Once the app has started reconciling you should see an out similar to the ones highlighted below

kubectl get app -n tkg-system pinniped
NAME       DESCRIPTION   SINCE-DEPLOY   AGE
pinniped   Reconciling   5s             49m

kubectl get app -n tkg-system pinniped
NAME       DESCRIPTION           SINCE-DEPLOY   AGE
pinniped   Reconcile succeeded   47s            50m

kubectl get jobs -n pinniped-supervisor
NAME                             COMPLETIONS   DURATION   AGE
pinniped-post-deploy-job-ver-1   1/1           9s         62s

kubectl get pods -n pinniped-supervisor
NAME                                   READY   STATUS      RESTARTS   AGE
pinniped-post-deploy-job-ver-1-fdk6n   0/1     Completed   0          70s
pinniped-supervisor-56fbb8cffd-6b8bh   1/1     Running     0          65s
pinniped-supervisor-56fbb8cffd-dkt6c   1/1     Running     0          65s