Aria Automation fails to start with blank /etc/hosts file contents
search cancel

Aria Automation fails to start with blank /etc/hosts file contents

book

Article ID: 314799

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

  • Following a reboot Aria Automation fails to start.

  • Aria Automation Appliance VM console screen consistently reporting an Error: [FAILED] Failed to start kube-proxy.service. 

  • The health check output when calling /opt/scripts/deploy.sh fails on the nodes-ready test
    • Running check nodes-ready
      make: *** [/opt/health/Makefile:56: nodes-ready] Error 1
  • Running the kubectl get nodes command reveals that one or two Aria Automation nodes are currently in a "NotReady" state.

  • The /var/log/deploy.log contains errors similar to:
    • <HOSTNAME> python3[99629]: [vracli] [DEBUG] executing bash on command-executor-xxxxx failed. Error: [: Command '['/usr/local/bin/kubectl', 'exec', '--namespace', 'kube-system', 'command-executor-xxxxx', '--', 'run-on-execd', '--', 'bash', '-c', '/opt/scripts/mon-fips.sh']' returned non-zero exit status 1.].Failed command: [['/usr/local/bin/kubectl', 'exec', '--namespace', 'kube-system', 'command-executor-xxxxx', '--', 'run-on-execd', '--', 'bash', '-c', '/opt/scripts/mon-fips.sh']].Exit code: [1]. Stderr: [error: unable to upgrade connection: Authorization error (user=kube-apiserver-kubelet-client, verb=create, resource=nodes, subresource=proxy)].
  • The systemd.journal log located under /var/services-logs/journal/ contains errors similar to:
    • Error: "MountVolume.SetUp failed for volume \"default-token-fcd6p\" (UniqueName: \"kubernetes.io/secret/<UUID>-default-token-xxxxx\") pod \"pipeline-ui-app-xxxxxxxxxx-xxxxx\" (UID: \"<UUID>\") : Get \"https://vra-k8s.local:6443/api/v1/namespaces/prelude/secrets/default-token-xxxxx\": dial tcp: lookup vra-k8s.local: Temporary failure in name resolution"
  • kubectl -n prelude get pods  command gives the following error:
    • Unable to connect to the server: dial tcp: lookup vra-k8s.local on <DNS-IP>:53: no such host
  • vracli commands give the error:
    • [ERROR] HTTPSConnectionPool(host='vra-k8s.local', port=6443): Max retries exceeded with url: /apis/apiextensions.k8s.io/v1/customresourcedefinitions/vaconfigs.prelude.vmware.com (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x4321abcde123>: Failed to establish a new connection: [Errno -2] Name or service not known'))
    • socket.gaierror: [Errno -3] Temporary failure in name resolution
  • The /etc/hosts file is blank or missing content on one or more of the Aria Automation appliances

  • Aria Automation becomes inaccessible via UI after you had migrated from one vCenter to another vCenter.

  • The Aria Automation Cluster fails to restart either from LCM or by running "deploy.sh --shutdown" as the node ended up in a not ready state. The node shows as tainted as per this kb: Aria Automation nodes in not ready state and deploy.sh fails but the taint returns after applying the resolution from the kb. 

Environment

VMware Aria Automation 8.x

Cause

The issue is caused by a race condition between the VAMI network settings boot-up scripts and the custom logic that is used to configure Kubernetes to work with the Core DNS service. With rare frequency the two services can attempt to update the /etc/hosts file at the same time which can blank the contents of the files.

Note: These symptoms can be seen in environments that have experienced unexpected outages due to storage and network unavailability to the underlying PhotonOS operating system. While a fix for this issue has been released in versions 8.14.x and above to prevent service level corruption of this file, this issue may be witnessed in versions as high as 8.18.1.

Resolution

This issue is fixed in VMware Aria Automation 8.14 and above. See the note in the Cause section for additional details if you are witnessing these symptoms. The Workaround in this section may still be used.

  • Copy the /etc/hosts file entries for a functioning node in the cluster and update it on each affected node.
    See example /etc/hosts file contents below from a VMware lab environment

 

NOTE: In case of a single node  Aria Automation deployment, '/etc/hosts' file can be updated with below entries:

127.0.0.1 localhost

127.0.0.1 vra-k8s.local

  • Restart kubelet service
    systemctl restart kubelet
  • Trigger a redeployment of the pods to ensure all services are healthy:

    /opt/scripts/deploy.sh