Liveness probe failed on etcd pods after Supervisor upgrade to v1.30.10 impacting kube-api
search cancel

Liveness probe failed on etcd pods after Supervisor upgrade to v1.30.10 impacting kube-api

book

Article ID: 411199

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

 Supervisor Cluster k8s api switches from available to unavailable  after a Supervisor upgrade to "v1.30.10+vmware.1-fips-vsc0.1.12-24799161" due to   Liveness probe failure on etcd pods after Supervisor upgrade to v1.30.10

LAST SEEN                TYPE      REASON      OBJECT                                      MESSAGE
2m24s (x7942 over 22h)   Warning   Unhealthy   Pod/etcd-##############################   (combined from similar events): Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "##############################": OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown
62s (x7792 over 21h)     Warning   Unhealthy   Pod/etcd-##############################   (combined from similar events): Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "##############################": OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown
7s (x7882 over 21h)      Warning   Unhealthy   Pod/etcd-##############################   (combined from similar events): Liveness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "##############################": OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown

vCenter UI shows errors like the below when selecting workload management >> Supervisor >> Configuring:  

Installed and Started Kubernetes Node Agent on the ESXi Host
A general system error occurred. Error message: Get "http://localhost:1080/external-cert/http1/#.#.#.#:6443/api/v1/nodes?fieldSelector=metadata.name%<esxi-hostname>.com": context deadline exceeded (Client.Timeout exceeded while awaiting headers).

API errors shown when attempting kubectl commands:  

#kubectl get nodes 

E1202 398062 1484259 memcache.go:265) couldn't get current server api group list:  "https://127.0.0.1:6443/api?timeout=3s": dial tcp 127.0.0.1:6443:  Connection refused

Environment

vCenter: 8.0U3

Supervisor: v1.30.10+vmware.1-fips-vsc0.1.12-24799161

Cause

 Liveness probe failure on etcd pods is due to the invalid command  format of the livenessProbe in the etcd manifest on each Supervisor's  /etc/kubernetes/manifests/etcd.yaml

    env:
    - name: ETCD_ENABLE_V2
      value: "true"
    image: etcd:v3.5.21_vmware.1-fips
    imagePullPolicy: Never
    livenessProbe:
      exec:
        command:
        - /bin/sh
        - -ec
        - ETCDCTL_API=3 etcdctl --endpoints=https://[127.0.0.1]:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt
          --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key
          get --consistency="s" foo
      failureThreshold: 8
      initialDelaySeconds: 15
      timeoutSeconds: 15

 

In etcd 3.5.7,  the binary /bin/sh was removed from the container image making the use of /bin/sh in the livenessProbe invalid

Resolution

Please Contact Global  Support if you encounter this issue

Additional Information

Note that this change will be reverted on Supervisor control plane VM recreation such as through Supervisor cluster upgrade and may need to be re-applied.