Supervisor Cluster Unhealthy with etcd and kube-apiserver containers failing to start with error message "etcdmain: tls: private key does not match public key"
search cancel

Supervisor Cluster Unhealthy with etcd and kube-apiserver containers failing to start with error message "etcdmain: tls: private key does not match public key"

book

Article ID: 375502

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Supervisor Cluster Unhealthy and down with the following error message reported on the vCenter's UI under Workload Management inventory - "System error occurred on Master Node with identifier returned non-zero exit status 1.." 

IMPACT :

  • Supervisor Cluster is inaccessible with the FIP as no leader is chosen among Supervisor Control Plane VM's. 
  • Unable to run "kubectl" commands within the cluster and containers such as "etcd, kube-apiserver, kube-scheduler" are crashing. 
  • Unable to deploy new Pods/VM's within the Tanzu Guest Clusters. 
  • Certificates of the Supervisor Cluster might be expired. 

Determining logs for the Exited/Crashing containers: 

crictl logs <Container_ID_Exited_etcd> : 

YYYY-MM-DD HH:MM:SS I | pkg/flags: recognized and used environment variable ETCD_ENABLE_V2=true
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
YYYY-MM-DD HH:MM:SS I | etcdmain: etcd Version: 3.4.13
YYYY-MM-DD HH:MM:SS I | etcdmain: Git SHA: GitNotFound
YYYY-MM-DD HH:MM:SS I | etcdmain: Go Version: go1.15.2
YYYY-MM-DD HH:MM:SS I | etcdmain: Go OS/Arch: linux/amd64
YYYY-MM-DD HH:MM:SS I | etcdmain: setting maximum number of CPUs to 16, total number of available CPUs is 16
YYYY-MM-DD HH:MM:SS N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
YYYY-MM-DD HH:MM:SS I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
YYYY-MM-DD HH:MM:SS C | etcdmain: tls: private key does not match public key


crictl logs <Container_ID_Exited_kube-apiserver> : 

Flag --experimental-encryption-provider-config has been deprecated, use --encryption-provider-config.
Flag --kubelet-https has been deprecated, API Server connections to kubelets always use https. This flag will be removed in 1.22.
IMMDD HH:MM:SS       1 server.go:629] external host was not specified
I0822 HH:MM:SS       1 server.go:181] Version: v1.21.0+vmware.wcp.2
Error: tls: private key does not match public key



WCP logs on vCenter Server, /var/log/vmware/wcp/wcpsvc.log: 

YYYY-MM-DDTHH:MM:SS debug wcp [kubelifecycle/kube_instance.go:5515] [opID=68c130d8-672ee620-db30-40af-987a-c7d025bcc8f7] Cluster is not ready yet, would
retry in 1m0s time.
YYYY-MM-DDTHH:MM:SS error wcp [vclib/guestop.go:338] [opID=68ca282d-672ee620-db30-40af-987a-c7d025bcc8f7-reconcile] Kubenode guest command failed. RC: 12 8, Out: , Err: Error: tls: private key does not match public key

YYYY-MM-DDTHH:MM:SS error wcp [licensemonitor/license_event_monitor.go:259] [opID=licenseRefreshMonitor] Supervisor control plane failed: No connectivity to API Master: connectivity Get "https://<Control_Plane_IP_Address>:6443/healthz?timeout=5s": dial tcp <Control_Plane_IP_Address>:6443: connect: no route to host, config status ERROR
YYYY-MM-DDTHH:MM:SS debug wcp [notifications/notifications.go:244] [opID=66cd08d8] No notifications. seqNum: 72, Current seqNum: 71
YYYY-MM-DDTHH:MM:SS error wcp [licensemonitor/license_event_monitor.go:259] [opID=licenseRefreshMonitor] Supervisor control plane failed: No connectivity to API Master: connectivity Get "https://<Control_Plane_IP_Address>:6443/healthz?timeout=5s": dial tcp <Control_Plane_IP_Address>:6443: connect: no route to host, config status ERROR 

Environment

VMware vSphere with Tanzu

Cause

Issue happens because the Server key and the Server certificate for ETCD does not matches on all of the Control Plane nodes or on either two of them because of which ETCD quorum is not maintained and Containers fails to start. 

Resolution

  1. Open the SSH for all three Control Plane nodes and determine which nodes are unhealthy with 'etcd' and 'kube-apiserver' crashing. 

  2. Change the directory to "/etc/kubernetes/pki/etcd" 

  3. Compare the modulus for the certificate and key and determine if a same string is returned or not (Ideally, the modulus should match between certificate and key or else TLS/SSL handshake shall fail) : 

    1. openssl  rsa -modulus -noout -in server.key

    2. openssl x509 -modulus -noout -in server.crt 

  4. If the modulus are not matching, then we would need to regenerate the certificates using steps from KB Replace vSphere with Tanzu Supervisor Certificates