Liveness probe failed on etcd pods after Supervisor upgrade to v1.30.10 impacting kube-api

search cancel

Liveness probe failed on etcd pods after Supervisor upgrade to v1.30.10 impacting kube-api

book

Article ID: 411199

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Supervisor Cluster k8s api switches from available to unavailable after a Supervisor upgrade to "v1.30.10+vmware.1-fips-vsc0.1.12-24799161" due to Liveness probe failure on etcd pods after Supervisor upgrade to v1.30.10

LAST SEEN TYPE REASON OBJECT MESSAGE
2m24s (x7942 over 22h) Warning Unhealthy Pod/etcd-############################## (combined from similar events): Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "##############################": OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown
62s (x7792 over 21h) Warning Unhealthy Pod/etcd-############################## (combined from similar events): Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "##############################": OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown
7s (x7882 over 21h) Warning Unhealthy Pod/etcd-############################## (combined from similar events): Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "##############################": OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown

vCenter UI shows errors like the below when selecting workload management >> Supervisor >> Configuring:

Installed and Started Kubernetes Node Agent on the ESXi Host
A general system error occurred. Error message: Get "http://localhost:1080/external-cert/http1/#.#.#.#:6443/api/v1/nodes?fieldSelector=metadata.name%<esxi-hostname>.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers).

API errors shown when attempting kubectl commands:

#kubectl get nodes

E1202 398062 1484259 memcache.go:265) couldn't get current server api group list: "https://127.0.0.1:6443/api?timeout=3s": dial tcp 127.0.0.1:6443: Connection refused

Environment

vCenter: 8.0U3

Supervisor: v1.30.10+vmware.1-fips-vsc0.1.12-24799161

Cause

Liveness probe failure on etcd pods is due to the invalid command format of the livenessProbe in the etcd manifest on each Supervisor's /etc/kubernetes/manifests/etcd.yaml

env:
- name: ETCD_ENABLE_V2
value: "true"
image: etcd:v3.5.21_vmware.1-fips
imagePullPolicy: Never
livenessProbe:
exec:
command:
- /bin/sh
- -ec
- ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
get --consistency="s" foo
failureThreshold: 8
initialDelaySeconds: 15
timeoutSeconds: 15

In etcd 3.5.7, the binary /bin/sh was removed from the container image making the use of /bin/sh in the livenessProbe invalid

Resolution

Please Contact Global Support if you encounter this issue

Additional Information

Note that this change will be reverted on Supervisor control plane VM recreation such as through Supervisor cluster upgrade and may need to be re-applied.

Feedback

thumb_up Yes

thumb_down No