Supervisor Cluster Unhealthy and down with the following error message reported on the vCenter's UI under Workload Management inventory - "System error occurred on Master Node with identifier returned non-zero exit status 1.."
IMPACT :
Determining logs for the Exited/Crashing containers:
crictl logs <Container_ID_Exited_etcd> :
YYYY-MM-DD HH:MM:SS I | pkg/flags: recognized and used environment variable ETCD_ENABLE_V2=true
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
YYYY-MM-DD HH:MM:SS I | etcdmain: etcd Version: 3.4.13
YYYY-MM-DD HH:MM:SS I | etcdmain: Git SHA: GitNotFound
YYYY-MM-DD HH:MM:SS I | etcdmain: Go Version: go1.15.2
YYYY-MM-DD HH:MM:SS I | etcdmain: Go OS/Arch: linux/amd64
YYYY-MM-DD HH:MM:SS I | etcdmain: setting maximum number of CPUs to 16, total number of available CPUs is 16
YYYY-MM-DD HH:MM:SS N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
YYYY-MM-DD HH:MM:SS I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
YYYY-MM-DD HH:MM:SS C | etcdmain: tls: private key does not match public key
crictl logs <Container_ID_Exited_kube-apiserver> :
Flag --experimental-encryption-provider-config has been deprecated, use --encryption-provider-config.
Flag --kubelet-https has been deprecated, API Server connections to kubelets always use https. This flag will be removed in 1.22.
IMMDD HH:MM:SS 1 server.go:629] external host was not specified
I0822 HH:MM:SS 1 server.go:181] Version: v1.21.0+vmware.wcp.2
Error: tls: private key does not match public key
WCP logs on vCenter Server, /var/log/vmware/wcp/wcpsvc.log: YYYY-MM-DDTHH:MM:SS error wcp [licensemonitor/license_event_monitor.go:259] [opID=licenseRefreshMonitor] Supervisor control plane failed: No connectivity to API Master: connectivity Get "https://<Control_Plane_IP_Address>:6443/healthz?timeout=5s": dial tcp <Control_Plane_IP_Address>:6443: connect: no route to host, config status ERROR
YYYY-MM-DDTHH:MM:SS debug wcp [notifications/notifications.go:244] [opID=66cd08d8] No notifications. seqNum: 72, Current seqNum: 71
YYYY-MM-DDTHH:MM:SS error wcp [licensemonitor/license_event_monitor.go:259] [opID=licenseRefreshMonitor] Supervisor control plane failed: No connectivity to API Master: connectivity Get "https://<Control_Plane_IP_Address>:6443/healthz?timeout=5s": dial tcp <Control_Plane_IP_Address>:6443: connect: no route to host, config status ERROR
VMware vSphere with Tanzu
Issue happens because the Server key and the Server certificate for ETCD does not matches on all of the Control Plane nodes or on either two of them because of which ETCD quorum is not maintained and Containers fails to start.
Regenerate the Certificates manually on the impacted nodes :
1. Make a Backup of all the existing certificates :
# mkdir -p /root/backup_manual
# cp -rfp /etc/kubernetes/* /root/backup_manual
# cp -rfp /dev/shm/wcp_decrypted_data/* /root/backup_manual
# cp -rfp /etc/vmware/wcp/tls/* /root/backup_manual
# cp -rfp /var/lib/kubelet/pki/* /root/backup_manual
# cp -rfp /etc/ssl/certs/* /root/backup_manual
2. Run the below command to regenerate all the certificates manually with 'kubeadm' :
3. Wait for around 5 minutes and check if the 'etcd' container starts and is not crashing with the same above log entries.
4. Perform the same steps on all the impacted Nodes where 'etcd' is crashing so as to maintain as quorum and for a leader to be elected.
Confirm if the containers are now running, nodes are reported in Ready status and 'kubectl' command is working.