Troubleshooting Kubernetes disk pressure or disk latency in VMware Aria Automation and Automation Orchestrator 8.x

Products

VMware Aria Suite

Issue/Introduction

If VMware Aria Automation or Automation Orchestrator services are not running but the node shows it is in ready state with pods in pending, the steps in this document may alleviate the issue and bring the cluster back online.

Symptoms:

Disk Pressure

Pods are all in Pending state.
Review of kube-system pods show multiple evicted pods recreating and evicting again in a loop:
```
kubectl get pods -n kube-system
```

Disk Latency

journalctl contains errors for etcd similar to the below:

...server is likely overloaded
...failed to send out heartbeat on time (exceeded the 100ms timeout for 17.346965944s, to 9a4c6c6012cbdb5a)

The kubelet and kube-api-server services restart randomly.
/opt/scripts/deploy.sh may fail
kubectl -n prelude get pods /Nodes command randomly gives the following error :
- - Unable to connect to the server: dial tcp: lookup vra-k8s.local on <DNS-IP>:53: no such host

Environment

Aria Automation 8.x
Aria Automation Orchestrator 8.x
VMware vRealize Automation 8.x (vRA)
VMware vRealize Orchestrator 8.x (vRO)

Cause

Disk Pressure

Either disk or memory pressure exists on one of the appliances in the cluster which will cause kube-system pods to evict. This will place the prelude pods into a pending state causing the system to become non-functional

To confirm if this is the case, review the journal with the following command:

journalctl -u kubelet

If the journal is very large, you can pipe to grep and look for entries relating to Disk Pressure or Memory Pressure. If further log reviewing is needed, add another grep via pipe and search by the date in format CCC ##" (e.g. - MAR 10):

grep -i journalctl -u kubelet | grep -i pressure

Disk Latency

Maximum Storage Latency is 20 ms for each disk IO operation from any Aria Automation node under the official product documentation System Requirements (techdocs.broadcom.com).

Resolution

Disk Pressure

Confirm that disk pressure is the issue with the steps in the "Cause" section of this document.
Verify by running vracli disk-mgr and df -i to check disk space and inode availability.
1. If disk use on the primary disk (generally SDA4) or inode utilization on the disk is above 80%, increase the size of the disk in vCenter, then at the terminal to expand the disk run:
```
vracli disk-mgr resize
```
Monitor kube-system with watch kubectl get pods -n kube-system and verify that the evictions stop and pods return to a running state. This may take several minutes.
Monitor your prelude pods with watch kubectl get pods -n prelude to confirm the prelude pods are starting. This will also take several minutes.

Disk Latency

Move the VMware Aria Automation or Automation Orchestrator appliances to Storage that can meet the Maximum Storage Latency requirements as defined by the official product documentation.

Workaround:

Pods stuck in Pending state

It is possible the pods will still stay in pending an not restart on their own. If this occurs, there are a few situations that may cause this:

If the disk was completely full on one of the nodes, it's possible that the docker images corrupted or otherwise encountered an issue.
There are problems in the kube-system pods.

If after 5-10 minutes waiting this is the case and no prelude pods have moved from pending to running, do the following:

Check for fluentd service via "systemctl status fluentd" and see if it's healthy. If this VRA is 8.1 or older, it will likely be "service fluentd status" instead. Restart the service if needed via "systemctl restart fluentd" (VRA 8.2+) or "service fluentd restart" (VRA8.1 and below).
If service is not restarting properly, run "/opt/scripts/restore-docker-images.sh" on all VRA nodes.

Once you've confirmed fluentd is in a healthy/running state, check for kube-system pods not starting:

Run
```
kubectl get pods -n kube-system
```
Check for any pods that are in non-running or completed states (e.g. "container-creating" or "error")
Run the below command to rebuild them if they are in a non-running or non-completed state and wait for this process to complete
```
kubectl delete pods -n kube-system podName
```

Once all kube-system pods are in a healthy state, you can monitor with kubectl get pods -n prelude --watch again to see if the pods start changing to running state. If the system still does not recover after several minutes, do the following:

Procedure to Restart Services

Run
```
/opt/scripts/deploy.sh --shutdown
```
Monitor pods in a separate terminal window to confirm they tear down successfully.
Run
```
/opt/scripts/deploy.sh
```

Additional Information

Impact/Risks:
VMware Aria Automation or Automation Orchestrator services become inaccessible from the web interface and you can no longer login.