VMware Aria Automation services fail to load with error "The connection to the server vra-k8s.local:6443 was refused

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Symptoms:

Aria Automation UI fails to load and upon checking the Kubernetes node or service status returns error

"The connection to the server vra-k8s.local:6443 was refused - did you specify the right host or port?"

In the systemd journal for etcd, nodes may fail to elect a leader:
```
/health error; no leader (status code 503)
```
Action-based extensibility (ABX) actions and deployments start to fail.

Running kubectl get nodes fails with the error

error: You must be logged in to the server (Unauthorized)

Note: kubectl restarts after this error

Running kubectl get pods --all-namespaces fails with the error

Unable to connect to the server: x509: certificate has expired or is not yet valid

Running journalctl -u kubelet contains entries similar to

Jan 26 11:46:40 applianceFQDN.vmware.com kubelet[5669]: F0126 11:46:40.942105 5669 server.go:266] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory
Jan 26 11:46:40 applianceFQDN.vmware.com kubelet[5669]: E0126 11:46:40.941998 5669 bootstrap.go:264] Part of the existing bootstrap client certificate is expired: 2021-01-16 17:13:45 +0000 UTC

Kubelet service fails to come online (Active) after the node reboot or K8s reinitialize:
```
Status: "exit status is 255"
```

Running journalctl -xeu kubelet contains entries similar to

Aug 14 01:25:40 applianceFQDN.vmware.com kubelet[5669]: F0126 01:25:40.942105 5669 server.go:266] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory
Aug 14 01:25:40 applianceFQDN.vmware.com kubelet[5669]: E0126 01:25:40.941998 5669 bootstrap.go:264] Part of the existing bootstrap client certificate is expired

Environment

etcd corruption:

VMware Aria Automation 8.x

Kubernetes certificate expiration:

VMware vRealize Automation/Orchestrator 8.0 through 8.4.1
It may be seen in later versions in some rare cases

Cause

The issue has at least two distinct causes:

It may be caused by the etcd cluster becoming corrupted or otherwise failing to elect a leader
It may be caused by the kubelet service certificate expiring after one year.

Resolution

This the kubelet certificate rollover is automatically handled in vRealize Automation & vRealize Orchestrator 8.2 and above.

For etcd corruption issues on a single node or cluster, these instructions are still valid and the same:

Workaround

Single VA Deployment

Take a snapshot of the vRA VM.
Locate an etcd backup at /data/etcd-backup/ and copy the selected backup to /root
Reset Kubernetes by running vracli cluster leave
Restore the etcd backup in /root by using the /opt/scripts/recover_etcd.sh command.

Examples:
8.0 - 8.8: # /opt/scripts/recover_etcd.sh --confirm /root/backup-12345
8.12 and newer: # vracli etcd restore --local --confirm /root/backup-123456789.db; systemctl start etcd

Extract VA config from etcd with

kubectl get vaconfig -o yaml | tee > /root/vaconfig.yaml

Reset Kubernetes once again using
```
vracli cluster leave
```

Run to Install the VA config

kubectl apply -f /root/vaconfig.yaml --force

Run vracli license to confirm that VA config is installed properly.
Run
```
/opt/scripts/deploy.sh
```

Clustered VAs Deployment with 3 Nodes

Take a snapshots of all 3 nodes.
Let's call one of the nodes a primary node. On the primary node, locate a etcd backup at /data/etcd-backup/ and preserved in /root.
Reset each node with
```
vracli cluster leave
```
On the primary node, restore the etcd backup taken at /root using the /opt/scripts/recover_etcd.sh command

Examples:
8.0 - 8.8: # /opt/scripts/recover_etcd.sh --confirm /root/backup-12345
8.12 and newer: # vracli etcd restore --local --confirm /root/backup-123456789.db; systemctl start etcd

Extract VA config from etcd with

kubectl get vaconfig -o yaml | tee > /root/vaconfig.yaml

Reset the node once again with
```
vracli cluster leave
```

Install VA config with

kubectl apply -f /root/vaconfig.yaml --force

Run vracli license to confirm that VA config is installed properly.

Note: vracli license is not applicable for vRO and CExP installations.

Join the other 2 nodes in the cluster by running the following command on each
```
 vracli cluster join [primary-node] --preservedata
```
Run /opt/scripts/deploy.sh from the primary node

Additional Information

https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/
https://github.com/kubernetes/kubeadm/issues/1753

Note:

This issue has been seen in environments as high as version 8.4.1.
Kubeadm is not provided with the vRA PhotonOS appliance.
The kubelet cert is rotated when the vracli cluster leave and etcd restore are completed.