Symptoms:
Running check nodes-ready
make: *** [/opt/health/Makefile:56: nodes-ready] Error 1
<HOSTNAME> python3[99629]: [vracli] [DEBUG] executing bash on command-executor-xxxxx failed. Error: [: Command '['/usr/local/bin/kubectl', 'exec', '--namespace', 'kube-system', 'command-executor-xxxxx', '--', 'run-on-execd', '--', 'bash', '-c', '/opt/scripts/mon-fips.sh']' returned non-zero exit status 1.].Failed command: [['/usr/local/bin/kubectl', 'exec', '--namespace', 'kube-system', 'command-executor-xxxxx', '--', 'run-on-execd', '--', 'bash', '-c', '/opt/scripts/mon-fips.sh']].Exit code: [1]. Stderr: [error: unable to upgrade connection: Authorization error (user=kube-apiserver-kubelet-client, verb=create, resource=nodes, subresource=proxy)].
Error: "MountVolume.SetUp failed for volume \"default-token-fcd6p\" (UniqueName: \"kubernetes.io/secret/<UUID>-default-token-xxxxx\") pod \"pipeline-ui-app-xxxxxxxxxx-xxxxx\" (UID: \"<UUID>\") : Get \"https://vra-k8s.local:6443/api/v1/namespaces/prelude/secrets/default-token-xxxxx\": dial tcp: lookup vra-k8s.local: Temporary failure in name resolution"
kubectl -n prelude get pods
command gives the following error:Unable to connect to the server: dial tcp: lookup vra-k8s.local on <DNS-IP>:53: no such host
vracli
commands give the error:
[ERROR] HTTPSConnectionPool(host='vra-k8s.local', port=6443): Max retries exceeded with url: /apis/apiextensions.k8s.io/v1/customresourcedefinitions/vaconfigs.prelude.vmware.com (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x4321abcde123>: Failed to establish a new connection: [Errno -2] Name or service not known'))
VMware Aria Automation 8.x
The issue is caused by a race condition between the VAMI network settings boot-up scripts and the custom logic that is used to configure Kubernetes to work with the Core DNS service. With rare frequency the two services can attempt to update the /etc/hosts file at the same time which can blank the contents of the files.
Note: These symptoms can be seen in environments that have experienced unexpected outages due to storage and network unavailability to the underlying PhotonOS operating system. While a fix for this has been released in versions 8.14.x and above for prevent service level corruption of this file, this issue may be witnessed in versions as high as 8.18.1.
This issue is fixed in VMware Aria Automation 8.14 and above. See the note in the Cause section for additional details if you are witnessing these symptoms. The Workaround in this section may still be used.
systemctl restart kubelet
Workaround: