Aria Automation Pods fail to initialize stuck at 0/1 Init:0/2

search cancel

book

calendar_today

VCF Operations/Automation (formerly VMware Aria Suite)

Error Code: LCMVRAVACONFIG590003
Cluster initialization failed on VMware Aria Automation.
Deploy.sh script initiated hangs with many pods stuck in 0/1 Init:0/2 and doesn't progress.
When viewing the pods using command : kubectl get pods -n prelude, a few of the pods are stuck at stage:
approval-service-app-pod-id 0/1 Init:0/2 0 15m
approval-service-app-pod-id 0/1 Init:0/2 0 15m
approval-service-app-pod-id 0/1 Init:0/2 0 15m
catalog-service-app-pod-id 0/1 Init:0/2 0 15m
catalog-service-app-pod-id 0/1 Init:0/2 0 15m
catalog-service-app-pod-id 0/1 Init:0/2 0 15m
ccs-gateway-app-pod-id 0/1 Init:0/2 0 15m
ccs-gateway-app-pod-id 0/1 Init:0/2 0 15m
ccs-gateway-app-pod-id 0/1 Init:0/2 0 15m
ccs-infra-eas-app-pod-id 0/1 Init:0/2 0 15m
ccs-infra-eas-app-pod-id 0/1 Init:0/2 0 15m
ccs-infra-eas-app-pod-id 0/1 Init:0/2 0 15m
When viewing pod logs, there are many api calls being made to other pods but returning 503 service unavailable or 'wait timed out'.

Aria Automation 8.x

The coredns pod had a restart thus the other pods were unable to resolve their dependent pod addresses to further instantiate.
The coredns pod is responsible to help resolve internal pod addresses and are referenced by other pods to identify the status of their dependent pods.

SSH / PuTTy into the virtual appliance in the cluster experiencing these Symptoms.
Run the following commands
1. Shutdown services
  - /opt/scripts/deploy.sh --shutdown
2. Remove cleanup files. Perform this on any node experiencing the Symptoms described.
  - rm -f /var/vmware/prelude/docker/last-cleanup
3. Reboot the appliance
4. Run first-boot checks
  - vracli status first-boot -w 1800
5. Wait or repeat the above until it reports the following
  - First boot complete
6. Run command "kubectl get node" to make sure all nodes are "Ready"
7. Run command "docker system prune -af" on all Aria Automation nodes one by one
8. Run command "/opt/scripts/restore_docker_images.sh" on all Aria Automation nodes one by one
9. Run command "/opt/scripts/deploy.sh" on any Aria Automation node
10. The deploy.sh may again get stuck at the same status.
11. Validate if coredns has a restart, kubectl get pods -n kube-system
12. Delete the all three coredns pods one after the other with command : kubectl delete pod -n kube-system codedns-xxxxx
13. Validate that the pods now come up as expected.

thumb_up Yes

thumb_down No