Disk Pressure
- Confirm that disk pressure is the issue with the steps in the "Cause" section of this document.
- Verify by running vracli disk-mgr and df -i to check disk space and inode availability.
- If disk use on the primary disk (generally SDA4) or inode utilization on the disk is above 80%, increase the size of the disk in vCenter, thenĀ at the terminal to expand the disk run:
vracli disk-mgr resize
- Monitor kube-system with watch kubectl get pods -n kube-system and verify that the evictions stop and pods return to a running state. This may take several minutes.
- Monitor your prelude pods with watch kubectl get pods -n prelude to confirm the prelude pods are starting. This will also take several minutes.
Disk Latency
- Move the VMware Aria Automation or Automation Orchestrator appliances to Storage that can meet the Maximum Storage Latency requirements as defined by the official product documentation.
Workaround:
Pods stuck in Pending state
It is possible the pods will still stay in pending an not restart on their own. If this occurs, there are a few situations that may cause this:
- If the disk was completely full on one of the nodes, it's possible that the docker images corrupted or otherwise encountered an issue.
- There are problems in the kube-system pods.
If after 5-10 minutes waiting this is the case and no prelude pods have moved from pending to running, do the following:
- Check for fluentd service via "systemctl status fluentd" and see if it's healthy. If this VRA is 8.1 or older, it will likely be "service fluentd status" instead. Restart the service if needed via "systemctl restart fluentd" (VRA 8.2+) or "service fluentd restart" (VRA8.1 and below).
- If service is not restarting properly, run "/opt/scripts/restore-docker-images.sh" on all VRA nodes.
Once you've confirmed
fluentd is in a healthy/running state, check for
kube-system pods not starting:
- Run
kubectl get pods -n kube-system
- Check for any pods that are in non-running or completed states (e.g. "container-creating" or "error")
- Run the below command to rebuild them if they are in a non-running or non-completed state and wait for this process to complete
kubectl delete pods -n kube-system podName
Once all kube-system pods are in a healthy state, you can monitor with
kubectl get pods -n prelude --watch again to see if the pods start changing to running state. If the system still does not recover after several minutes, do the following:
Procedure to Restart Services
- Run
/opt/scripts/deploy.sh --shutdown
- Monitor pods in a separate terminal window to confirm they tear down successfully.
- Run
/opt/scripts/deploy.sh