Aria Automation kubectl & vracli commands cannot find vra-k8s.local:6443 and the docker service cannot be started (failed)

search cancel

Aria Automation kubectl & vracli commands cannot find vra-k8s.local:6443 and the docker service cannot be started (failed)

book

Article ID: 432577

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Symptoms:

vracli and kubectl commands returned errors contacting localhost / cluster on port 6443 (k8s)
- Error: "The connection to the server vra-k8s.local:6443 was refused - did you specify the right host or port?"
kubectl -n prelude get pods -o wide on another node may show 3 pods scheduled for each service, but the affected node's pods show 0/# containers started and node name <none>
- This implies that the node became unavailable from a Kubernetes perspective while pods were running on it
Docker service is in failed status after any attempt to restart it
- systemctl status docker
- systemctl restart docker
Docker service may show an error such as: failed to start daemon: failed to dial"/run/containerd/containerd.sock": unknown servicecontainerd.services.namespaces.v1.Namespaces: not implemented
- dockerd --debug
- journalctl -xeu docker
Containerd is started and running
- systemctl status containerd
/etc/hosts file looks fine on this node
df -h shows good disk space on all filesystems

Environment

VMware Aria Automation 8.x

Cause

There was an issue with the docker service which keeps it from starting on this node.

Resolution

As a precaution, it is best to take a simultaneous non-memory snapshot of all Automation nodes. This can be done in vSphere if the task fails in Aria Lifecycle.

If only one node in the cluster is affected, reboot the affected node only.
There are 2 ways you can do this:

In vSphere, for the affected node ONLY, Actions > Power > Restart Guest OS
- or -
On the SSH session for the affected node ONLY, run command:
- reboot

With no obvious config issue on the system, Docker can be expected to start successfully on reboot.

If all nodes in the cluster face this issue, cluster services can be restarted with /opt/scripts/deploy.sh with an expected downtime of about 30 minutes.

Feedback

thumb_up Yes

thumb_down No