VMware Aria Automation services fail to load with error "The connection to the server vra-k8s.local:6443 was refused

search cancel

VMware Aria Automation services fail to load with error "The connection to the server vra-k8s.local:6443 was refused - did you specify the right host or port?"

book

Article ID: 318821

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Aria Automation UI fails to load and upon checking the Kubernetes node or service status returns below error.

"The connection to the server vra-k8s.local:6443 was refused - did you specify the right host or port?"

In the systemd journal for etcd, nodes may fail to elect a leader:
```
/health error; no leader (status code 503)
```
Action-based extensibility (ABX) actions and deployments start to fail.

Running kubectl get nodes fails with the error

error: You must be logged in to the server (Unauthorized)

Note: kubectl restarts after this error

Running kubectl get pods --all-namespaces fails with the error

Unable to connect to the server: x509: certificate has expired or is not yet valid

Kubelet service fails to come online (Active) after the node reboot or K8s reinitialize:
```
Status: "exit status is 255"
```

Running journalctl -xeu kubelet contains entries similar to

Aug 14 01:25:40 applianceFQDN.vmware.com kubelet[5669]: F0126 01:25:40.942105 5669 server.go:266] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory
Aug 14 01:25:40 applianceFQDN.vmware.com kubelet[5669]: E0126 01:25:40.941998 5669 bootstrap.go:264] Part of the existing bootstrap client certificate is expired

Environment

VMware Aria Automation 8.x

Cause

The issue has at least two distinct causes:

It may be caused by the etcd cluster becoming corrupted or otherwise failing to elect a leader
It may be caused by the kubelet service certificate expiring after one year.

Resolution

The kubelet certificate rollover is automatically handled in vRealize Automation & vRealize Orchestrator 8.2 and above.

For etcd corruption issues on a single node or cluster, these instructions are still valid and the same:

Workaround

Single VA Deployment

Take a snapshot of the vRA VM.
Locate an etcd backup at /data/etcd-backup/ and copy the selected backup to /root.
Reset Kubernetes by running vracli cluster leave.
Restore the etcd backup in /root by using the /opt/scripts/recover_etcd.sh command.

Examples:

8.0 - 8.8:  # /opt/scripts/recover_etcd.sh --confirm /root/backup-12345
8.12 and newer:  # vracli etcd restore --local --confirm /root/backup-123456789.db; systemctl start etcd

Extract VA config from etcd with:

kubectl get vaconfig -o yaml | tee > /root/vaconfig.yaml

Reset Kubernetes once again using:
```
vracli cluster leave
```

Run to Install the VA config:

kubectl apply -f /root/vaconfig.yaml --force

Run vracli license to confirm that VA config is installed properly.
Run:
```
/opt/scripts/deploy.sh
```

Clustered VAs Deployment with 3 Nodes

Take snapshots of all 3 nodes.
Let's call one of the nodes a primary node. On the primary node, locate an etcd backup at /data/etcd-backup/ and preserved in /root.
Reset each node with:
```
vracli cluster leave
```
On the primary node, restore the etcd backup taken at /root using the /opt/scripts/recover_etcd.sh command.

Examples:

8.0 - 8.8:  # /opt/scripts/recover_etcd.sh --confirm /root/backup-12345
8.12 and newer:  # vracli etcd restore --local --confirm /root/backup-123456789.db; systemctl start etcd

Extract VA config from etcd with:

kubectl get vaconfig -o yaml | tee > /root/vaconfig.yaml

Reset the node once again with:
```
vracli cluster leave
```

Install VA config with:

kubectl apply -f /root/vaconfig.yaml --force

Run vracli license to confirm that VA config is installed properly.

Note: vracli license is not applicable for vRO and CExP installations.

Join the other 2 nodes in the cluster by running the following command on each:
```
vracli cluster join [primary-node] --preservedata
```
Run /opt/scripts/deploy.sh from the primary node.

Additional Information

https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/
https://github.com/kubernetes/kubeadm/issues/1753

Note:

This issue has been seen in environments as high as version 8.4.1.
Kubeadm is not provided with the vRA PhotonOS appliance.
The kubelet cert is rotated when the vracli cluster leave and etcd restore are completed.

Feedback

thumb_up Yes

thumb_down No