VMware Aria Automation services fail to load with error "The connection to the server vra-k8s.local:6443 was refused - did you specify the right host or port?"
search cancel

VMware Aria Automation services fail to load with error "The connection to the server vra-k8s.local:6443 was refused - did you specify the right host or port?"

book

Article ID: 318821

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Symptoms:

  • Aria Automation UI fails to load and upon checking the Kubernetes node or service status returns error
    "The connection to the server vra-k8s.local:6443 was refused - did you specify the right host or port?"
  • In the systemd journal for etcd, nodes may fail to elect a leader:
    /health error; no leader (status code 503)
  • Action-based extensibility (ABX) actions and deployments start to fail.
  • Running kubectl get nodes fails with the error
    error: You must be logged in to the server (Unauthorized)
Note:  kubectl restarts after this error
  • Running kubectl get pods --all-namespaces fails with the error
    Unable to connect to the server: x509: certificate has expired or is not yet valid
  • Running journalctl -u kubelet contains entries similar to
    Jan 26 11:46:40 applianceFQDN.vmware.com kubelet[5669]: F0126 11:46:40.942105 5669 server.go:266] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory
    Jan 26 11:46:40 applianceFQDN.vmware.com kubelet[5669]: E0126 11:46:40.941998 5669 bootstrap.go:264] Part of the existing bootstrap client certificate is expired: 2021-01-16 17:13:45 +0000 UTC
  • Kubelet service fails to come online (Active) after the node reboot or K8s reinitialize:
    Status: "exit status is 255"
  • Running journalctl -xeu kubelet contains entries similar to
    Aug 14 01:25:40 applianceFQDN.vmware.com kubelet[5669]: F0126 01:25:40.942105 5669 server.go:266] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or directory
    Aug 14 01:25:40 applianceFQDN.vmware.com kubelet[5669]: E0126 01:25:40.941998 5669 bootstrap.go:264] Part of the existing bootstrap client certificate is expired

Environment

etcd corruption:

  • VMware Aria Automation 8.x

Kubernetes certificate expiration:

  • VMware vRealize Automation/Orchestrator 8.0 through 8.4.1
  • It may be seen in later versions in some rare cases

 

Cause

The issue has at least two distinct causes:

  • It may be caused by the etcd cluster becoming corrupted or otherwise failing to elect a leader
  • It may be caused by the kubelet service certificate expiring after one year.

Resolution

This the kubelet certificate rollover is automatically handled in vRealize Automation & vRealize Orchestrator 8.2 and above.

For etcd corruption issues on a single node or cluster, these instructions are still valid and the same:

Workaround

Single VA Deployment

  1. Take a snapshot of the vRA VM.
  2. Locate an etcd backup at /data/etcd-backup/ and copy the selected backup to /root
  3. Reset Kubernetes by running vracli cluster leave
  4. Restore the etcd backup in /root by using the /opt/scripts/recover_etcd.sh command.
Examples:
8.0 - 8.8:  # /opt/scripts/recover_etcd.sh --confirm /root/backup-12345
8.12 and newer:  # vracli etcd restore --local --confirm /root/backup-123456789.db; systemctl start etcd
 
  1. Extract VA config from etcd with
    kubectl get vaconfig -o yaml | tee > /root/vaconfig.yaml
  2. Reset Kubernetes once again using
    vracli cluster leave
  3. Run to Install the VA config
    kubectl apply -f /root/vaconfig.yaml --force
  4. Run vracli license to confirm that VA config is installed properly.
  5. Run
    /opt/scripts/deploy.sh

Clustered VAs Deployment with 3 Nodes

  1. Take a snapshots of all 3 nodes.
  2. Let's call one of the nodes a primary node. On the primary node, locate a etcd backup at /data/etcd-backup/ and preserved in /root.
  3. Reset each node with
    vracli cluster leave
  4. On the primary node, restore the etcd backup taken at /root using the /opt/scripts/recover_etcd.sh command
Examples:
8.0 - 8.8:  # /opt/scripts/recover_etcd.sh --confirm /root/backup-12345
8.12 and newer:  # vracli etcd restore --local --confirm /root/backup-123456789.db; systemctl start etcd
  1. Extract VA config from etcd with
    kubectl get vaconfig -o yaml | tee > /root/vaconfig.yaml
  2. Reset the node once again with
    vracli cluster leave
  3. Install VA config with
    kubectl apply -f /root/vaconfig.yaml --force
  4. Run vracli license to confirm that VA config is installed properly.
Note: vracli license is not applicable for vRO and CExP installations.
  1. Join the other 2 nodes in the cluster by running the following command on each
     vracli cluster join [primary-node] --preservedata
  2. Run /opt/scripts/deploy.sh from the primary node

Additional Information

https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-certs/
https://github.com/kubernetes/kubeadm/issues/1753

Note:

  • This issue has been seen in environments as high as version 8.4.1.
  • Kubeadm is not provided with the vRA PhotonOS appliance.
  • The kubelet cert is rotated when the vracli cluster leave and etcd restore are completed.