Supervisor Cluster Unhealthy with etcd and kube-apiserver containers failing to start with error message "etcdmain: tls: private key does not match public key"
search cancel

Supervisor Cluster Unhealthy with etcd and kube-apiserver containers failing to start with error message "etcdmain: tls: private key does not match public key"

book

Article ID: 375502

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

Supervisor Cluster Unhealthy and down with the following error message reported on the vCenter's UI under Workload Management inventory - "System error occurred on Master Node with identifier returned non-zero exit status 1.." 

IMPACT :

  • Supervisor Cluster is inaccessible with the FIP as no leader is chosen among Supervisor Control Plane VM's. 
  • Unable to run "kubectl" commands within the cluster and containers such as "etcd, kube-apiserver, kube-scheduler" are crashing. 
  • Unable to deploy new Pods/VM's within the Tanzu Guest Clusters. 
  • Certificates of the Supervisor Cluster might be expired. 

Determining logs for the Exited/Crashing containers: 

crictl logs <Container_ID_Exited_etcd> : 

YYYY-MM-DD HH:MM:SS I | pkg/flags: recognized and used environment variable ETCD_ENABLE_V2=true
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
YYYY-MM-DD HH:MM:SS I | etcdmain: etcd Version: 3.4.13
YYYY-MM-DD HH:MM:SS I | etcdmain: Git SHA: GitNotFound
YYYY-MM-DD HH:MM:SS I | etcdmain: Go Version: go1.15.2
YYYY-MM-DD HH:MM:SS I | etcdmain: Go OS/Arch: linux/amd64
YYYY-MM-DD HH:MM:SS I | etcdmain: setting maximum number of CPUs to 16, total number of available CPUs is 16
YYYY-MM-DD HH:MM:SS N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
YYYY-MM-DD HH:MM:SS I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
YYYY-MM-DD HH:MM:SS C | etcdmain: tls: private key does not match public key


crictl logs <Container_ID_Exited_kube-apiserver> : 

Flag --experimental-encryption-provider-config has been deprecated, use --encryption-provider-config.
Flag --kubelet-https has been deprecated, API Server connections to kubelets always use https. This flag will be removed in 1.22.
IMMDD HH:MM:SS       1 server.go:629] external host was not specified
I0822 HH:MM:SS       1 server.go:181] Version: v1.21.0+vmware.wcp.2
Error: tls: private key does not match public key



WCP logs on vCenter Server, /var/log/vmware/wcp/wcpsvc.log: 

YYYY-MM-DDTHH:MM:SS error wcp [licensemonitor/license_event_monitor.go:259] [opID=licenseRefreshMonitor] Supervisor control plane failed: No connectivity to API Master: connectivity Get "https://<Control_Plane_IP_Address>:6443/healthz?timeout=5s": dial tcp <Control_Plane_IP_Address>:6443: connect: no route to host, config status ERROR
YYYY-MM-DDTHH:MM:SS debug wcp [notifications/notifications.go:244] [opID=66cd08d8] No notifications. seqNum: 72, Current seqNum: 71
YYYY-MM-DDTHH:MM:SS error wcp [licensemonitor/license_event_monitor.go:259] [opID=licenseRefreshMonitor] Supervisor control plane failed: No connectivity to API Master: connectivity Get "https://<Control_Plane_IP_Address>:6443/healthz?timeout=5s": dial tcp <Control_Plane_IP_Address>:6443: connect: no route to host, config status ERROR 

Environment

VMware vSphere with Tanzu 

Cause

Issue happens because the Server key and the Server certificate for ETCD does not matches on all of the Control Plane nodes or on either two of them because of which ETCD quorum is not maintained and Containers fails to start. 

Resolution

  1. Open the SSH for all three Control Plane nodes and determine which nodes are unhealthy with 'etcd' and 'kube-apiserver' crashing. 

  2. Change the directory to "/etc/kubernetes/pki/etcd

  3. Compare the modulus for the certificate and key and determine if a same string is returned or not (Ideally, the modulus should match between certificate and key or else TLS/SSL handshake shall fail) : 

    1. openssl  rsa -modulus -noout -in server.key

    2. openssl x509 -modulus -noout -in server.crt 

  4. If the modulus are not matching, then we would need to regenerate the certificates on the Impacted Nodes manually as "certmgr" utility might fail. 


Regenerate the Certificates manually on the impacted nodes

1. Make a Backup of all the existing certificates : 

# mkdir -p /root/backup_manual
# cp -rfp /etc/kubernetes/* /root/backup_manual
# cp -rfp /dev/shm/wcp_decrypted_data/* /root/backup_manual
# cp -rfp /etc/vmware/wcp/tls/* /root/backup_manual
# cp -rfp /var/lib/kubelet/pki/* /root/backup_manual
# cp -rfp /etc/ssl/certs/* /root/backup_manual

2. Run the below command to regenerate all the certificates manually with 'kubeadm' : 

  • kubeadm alpha certs renew all

  • Note : For older version of Kubernetes 'alpha' might not work, hence renew with : kubeadm certs renew all 

3. Wait for around 5 minutes and check if the 'etcd' container starts and is not crashing with the same above log entries. 

  • watch crictl ps 

4. Perform the same steps on all the impacted Nodes where 'etcd' is crashing so as to maintain as quorum and for a leader to be elected. 

Confirm if the containers are now running, nodes are reported in Ready status and 'kubectl' command is working.