Certs not renewed for 2 of the control plane nodes.
search cancel

Certs not renewed for 2 of the control plane nodes.

book

Article ID: 399730

calendar_today

Updated On:

Products

VMware Telco Cloud Automation VMware Tanzu Kubernetes Grid Management

Issue/Introduction

1/3 CP node got renewal, the other 2 CP still didn't have the updated certs.

The KCP status showed EtcdClusterUnhealthy is false "'Following machine is reporting etcd member errors: cl-cp-1-faulty'", but the machine/node cl-cp-1-faulty doesn't exist on vCenter or machine list.

 

Environment

TKGm 2.1.1

Management cluster version is 1.24.10

TCA 3.2

Cause

The KCP status.replicas=4 but spec.replicas=3.

It is because that during rolling update when KCP tried to rotate the certificate on each control plane, the first control plane machine was successfully recreated with the newly created leaf certificate. But it failed when created the second machine due to etcd error. KCP seemed to scale down when this error happened, the machine was deleted, but KCP considered the scaling down a failure. This blocks its procedure on other rolling updates.

Resolution

Workaround:

  1. You need to make sure this machine cl-cp-1-faulty is deleted. 
    k get machines -A

     

  2. You need to make sure etcd members are healthy with no zombie members.
    ETCDCTL_API=3 etcdctl member list --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key

     

  3.  Restart KCP to make sure the status is correct.
    kubectl rollout restart deployment capi-kubeadm-control-plane-controller-manager -n capi-kubeadm-control-plane-system

     

  4. After KCP is recovered, the control plane will roll out update, as the etcd parameter has changed and cert expiration time is less then 90d.