Aria Automation fails to start with blank /etc/hosts file contents

Products

VMware Aria Suite

Issue/Introduction

Symptoms:

Following a reboot Aria Automation fails to start.
The health check output when calling /opt/scripts/deploy.sh fails on the nodes-ready test
- Running check nodes-ready
  make: *** [/opt/health/Makefile:56: nodes-ready] Error 1

The /var/log/deploy.log contains errors similar to:
- <HOSTNAME> python3[99629]: [vracli] [DEBUG] executing bash on command-executor-xxxxx failed. Error: [: Command '['/usr/local/bin/kubectl', 'exec', '--namespace', 'kube-system', 'command-executor-xxxxx', '--', 'run-on-execd', '--', 'bash', '-c', '/opt/scripts/mon-fips.sh']' returned non-zero exit status 1.].Failed command: [['/usr/local/bin/kubectl', 'exec', '--namespace', 'kube-system', 'command-executor-xxxxx', '--', 'run-on-execd', '--', 'bash', '-c', '/opt/scripts/mon-fips.sh']].Exit code: [1]. Stderr: [error: unable to upgrade connection: Authorization error (user=kube-apiserver-kubelet-client, verb=create, resource=nodes, subresource=proxy)].

The systemd.journal log located under /var/services-logs/journal/ contains errors similar to:
- Error: "MountVolume.SetUp failed for volume \"default-token-fcd6p\" (UniqueName: \"kubernetes.io/secret/<UUID>-default-token-xxxxx\") pod \"pipeline-ui-app-xxxxxxxxxx-xxxxx\" (UID: \"<UUID>\") : Get \"https://vra-k8s.local:6443/api/v1/namespaces/prelude/secrets/default-token-xxxxx\": dial tcp: lookup vra-k8s.local: Temporary failure in name resolution"
kubectl -n prelude get pods command gives the following error:
- Unable to connect to the server: dial tcp: lookup vra-k8s.local on <DNS-IP>:53: no such host
vracli commands give the error:
- [ERROR] HTTPSConnectionPool(host='vra-k8s.local', port=6443): Max retries exceeded with url: /apis/apiextensions.k8s.io/v1/customresourcedefinitions/vaconfigs.prelude.vmware.com (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x4321abcde123>: Failed to establish a new connection: [Errno -2] Name or service not known'))
The /etc/hosts file is blank or missing content on one or more of the Aria Automation appliances
VRA becomes inaccessible via UI after you had migrated from one VCenter to another VCenter.

Environment

VMware Aria Automation 8.x

Cause

The issue is caused by a race condition between the VAMI network settings boot-up scripts and the custom logic that is used to configure Kubernetes to work with the Core DNS service. With rare frequency the two services can attempt to update the /etc/hosts file at the same time which can blank the contents of the files.

Note: These symptoms can be seen in environments that have experienced unexpected outages due to storage and network unavailability to the underlying PhotonOS operating system. While a fix for this has been released in versions 8.14.x and above for prevent service level corruption of this file, this issue may be witnessed in versions as high as 8.18.1.

Resolution

This issue is fixed in VMware Aria Automation 8.14 and above. See the note in the Cause section for additional details if you are witnessing these symptoms. The Workaround in this section may still be used.

Workaround:

Copy the /etc/hosts file entries for a functioning node in the cluster and update it on each affected node.
See example /etc/hosts file contents below from a VMware lab environment
Restart kubelet service
```
systemctl restart kubelet
```