Aria Automation Pods fail to initialize stuck at 0/1 Init:0/2
search cancel

Aria Automation Pods fail to initialize stuck at 0/1 Init:0/2

book

Article ID: 396799

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

  • Error Code: LCMVRAVACONFIG590003
    Cluster initialization failed on VMware Aria Automation.
  • Deploy.sh script initiated hangs with many pods stuck in 0/1 Init:0/2 and doesn't progress. 
  • When viewing the pods using command : kubectl get pods -n prelude, a few of the pods are stuck at stage:
    approval-service-app-pod-id         0/1     Init:0/2   0          15m
    approval-service-app-pod-id             0/1     Init:0/2   0          15m
    approval-service-app-pod-id             0/1     Init:0/2   0          15m
    catalog-service-app-pod-id              0/1     Init:0/2   0          15m
    catalog-service-app-pod-id              0/1     Init:0/2   0          15m
    catalog-service-app-pod-id              0/1     Init:0/2   0          15m
    ccs-gateway-app-pod-id                   0/1     Init:0/2   0          15m
    ccs-gateway-app-pod-id                   0/1     Init:0/2   0          15m
    ccs-gateway-app-pod-id                   0/1     Init:0/2   0          15m
    ccs-infra-eas-app-pod-id                0/1     Init:0/2   0          15m
    ccs-infra-eas-app-pod-id                0/1     Init:0/2   0          15m
    ccs-infra-eas-app-pod-id                0/1     Init:0/2   0          15m
  • When viewing pod logs, there are many api calls being made to other pods but returning 503 service unavailable or 'wait timed out'. 

Environment

Aria Automation 8.x

Cause

  • The coredns pod had a restart thus the other pods were unable to resolve their dependent pod addresses to further instantiate.   
  • The coredns pod is responsible to help resolve internal pod addresses and are referenced by other pods to identify the status of their dependent pods. 

Resolution

  • SSH / PuTTy into the virtual appliance in the cluster experiencing these Symptoms.
  • Run the following commands
    1. Shutdown services
      • /opt/scripts/deploy.sh --shutdown
    2. Remove cleanup files. Perform this on any node experiencing the Symptoms described.
      • rm -f /var/vmware/prelude/docker/last-cleanup
    3. Reboot the appliance
    4. Run first-boot checks
      • vracli status first-boot -w 1800
    5. Wait or repeat the above until it reports the following
      • First boot complete
    6. Run command "kubectl get node" to make sure all nodes are "Ready"
    7. Run command "docker system prune -af" on all Aria Automation nodes one by one
    8. Run command "/opt/scripts/restore_docker_images.sh" on all Aria Automation nodes one by one
    9. Run command "/opt/scripts/deploy.sh" on any Aria Automation node
    10. The deploy.sh may again get stuck at the same status. 
    11. Validate if coredns has a restart, kubectl get pods -n kube-system 
    12. Delete the all three coredns pods one after the other with command : kubectl delete pod -n kube-system codedns-xxxxx
    13. Validate that the pods now come up as expected.