Aria Automation startup script deploy.sh fails due to many kube-system pods which cannot be deleted
search cancel

Aria Automation startup script deploy.sh fails due to many kube-system pods which cannot be deleted

book

Article ID: 418045

calendar_today

Updated On:

Products

VCF Automation

Issue/Introduction

Possible symptoms:

  • The deploy.sh script fails with the following error:
    • Found 3 nodes but only 2 noop pods, will retry.
  • One node is not running some of its Automation/Orchestrator services as shown by this command:
    • kubectl get pods -n prelude -o wide

This article deals with the scenario where there are many pods which are stuck in a non-running state. So it is defined by the following: 

  • These results show many pods of the same image, for example: "lcc-update-registry-credentials" which are stuck in a non-running status such as Terminating 
    • kubectl get pods -n kube-system -o wide
    • It is assumed these pods are on one node of the 3, which can be confirmed with the NODE column output of the above command.
  • It is also assumed that the pods do not need to exit gracefully. It is possible to try draining the node with kubectl after cordoning it, but many of the pods may remain in Terminating status because eviction allows them to close gracefully.

Environment

VMware Aria Automation 8.x

Resolution

It is possible to forcibly delete these pods as follows:

  1. Cordon off the node where these stuck pods reside to prevent new pods being scheduled to the node:
    • kubectl cordon <vRA-node-with-stuck-pods>
  2. Identify a label which will allow us to collectively select the stuck pods for deletion:
    • kubectl -n kube-system get pods --show-labels
  3. In our example, the pods belong to app=lcc-update-registry-credentials which is listed in the LABELS column 
    • kubectl -n kube-system delete pods -l app=lcc-update-registry-credentials
  4. Once the above command has deleted pods, if it hangs for over a minute we may kill it with Ctrl+C.
  5. If there still any stuck pods on this label remaining, we may restart kubelet:
    1. kubectl -n kube-system get pods -l app=lcc-update-registry-credentials
    2. If the above output shows any of these still in Terminating status:
      • systemctl restart kubelet
    3. Confirm the pods status once more:
      • kubectl -n kube-system get pods -l app=lcc-update-registry-credentials
      • kubectl -n kube-system get pods
  6. If the problem is resolved with pods returning to healthy state, the node can be uncordoned to allow pod scheduling again:
    • kubectl uncordon <vRA-node-with-stuck-pods>
  7. At this point, the safest way to ensure a healthy cluster state is to run a full system deploy (delete+recreate services: non-destructive but takes ~30mins)
    • /opt/script/deploy.sh