Supervisor Cluster k8s api switches from available to unavailable after a Supervisor upgrade to "v1.30.10+vmware.1-fips-vsc0.1.12-24799161" due to Liveness probe failure on etcd pods after Supervisor upgrade to v1.30.10
LAST SEEN TYPE REASON OBJECT MESSAGE2m24s (x7942 over 22h) Warning Unhealthy Pod/etcd-############################## (combined from similar events): Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "##############################": OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown62s (x7792 over 21h) Warning Unhealthy Pod/etcd-############################## (combined from similar events): Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "##############################": OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown7s (x7882 over 21h) Warning Unhealthy Pod/etcd-############################## (combined from similar events): Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "##############################": OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown
vCenter UI shows errors like the below when selecting workload management >> Supervisor >> Configuring:
Installed and Started Kubernetes Node Agent on the ESXi Host
A general system error occurred. Error message: Get "http://localhost:1080/external-cert/http1/#.#.#.#:6443/api/v1/nodes?fieldSelector=metadata.name%<esxi-hostname>.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers).
API errors shown when attempting kubectl commands:
#kubectl get nodes
E1202 398062 1484259 memcache.go:265) couldn't get current server api group list: "https://127.0.0.1:6443/api?timeout=3s": dial tcp 127.0.0.1:6443: Connection refused
vCenter: 8.0U3
Supervisor: v1.30.10+vmware.1-fips-vsc0.1.12-24799161
Liveness probe failure on etcd pods is due to the invalid command format of the livenessProbe in the etcd manifest on each Supervisor's /etc/kubernetes/manifests/etcd.yaml
env: - name: ETCD_ENABLE_V2 value: "true" image: etcd:v3.5.21_vmware.1-fips imagePullPolicy: Never livenessProbe: exec: command: - /bin/sh - -ec - ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key get --consistency="s" foo failureThreshold: 8 initialDelaySeconds: 15 timeoutSeconds: 15
In etcd 3.5.7, the binary /bin/sh was removed from the container image making the use of /bin/sh in the livenessProbe invalid
Please Contact Global Support if you encounter this issue
Note that this change will be reverted on Supervisor control plane VM recreation such as through Supervisor cluster upgrade and may need to be re-applied.