VM-Class change on the Guest cluster yaml does not provision new VMs with new VM class.

Products

VMware vSphere Kubernetes Service

Issue/Introduction

The guest cluster's VM class was updated in the cluster YAML, but new VMs using the updated VM class failed to provision.
Describe of the kcp for the guest cluster shows:

Message: Following machines are reporting unknown etcd member status: <cluster-name-control-plane-A>,<cluster-name-control-plane-B>,<cluster-name-control-plane-C>

Logs snippets from var/log/pods/vmware-system-capw_kubeadm-control-plane-controller-manager/manager/x.log on Supervisor has below entries:

YYYY-MM-DDTHH:MM:SS. stderr F E0308 12:36:45.0070611 controller.go:326] "Reconciler error" err="failed to get etcdStatus for workload cluster <cluster-name>: failed to create etcd client: could not establish a connection to any etcd node: unable to create etcd client: context deadline exceeded" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" kubeadmControlPlane="<namespace/cluster-name>" namespace="<namespace-name>" name="<cluster-name>" reconcileID=<ID>

YYYY-MM-DDTHH:MM:SS. stderr F I0116 22:15:01.5345821 remediation.go:286] "etcd cluster projected after remediation of cluster-CP-VM-name" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" kubeadmControlPlane="namespace-name/cluster-name" namespace="namespace-name" name="cluster-name" reconcileID=...

Curl over port 6443 to the svc of the Guest cluster returned status 'OK' which means the Control Plane nodes are able to communicate between each other.

Environment

vSphere with Tanzu

Cause

The older version of CAPI has difficulty managing unresponsive guest clusters.
If any of the following controllers on the Supervisor are in an error state (Terminating or Pending), guest cluster reconciliation gets queued:
- capi-controller-manager
- capi-kubeadm-control-plane-controller-manager
- capi-kubeadm-bootstrap-controller-manager

Resolution

Verify the Guest cluster etcd is healthy.
- Example:
  
  root@************** [ ~ ]# etcdctl --cluster=true endpoint health -w table
  +--------------------------+--------+------------+-------+
  | ENDPOINT | HEALTH | TOOK | ERROR |
  +--------------------------+--------+------------+-------+
  | https://**************:2379 | true | 4.671596ms | |
  | https://**************:2379 | true | 7.120376ms | |
  | https://**************:2379 | true | 7.356998ms | |
  +--------------------------+--------+------------+-------+
  
  root@************** [ ~ ]# etcdctl member list -w table
  +------------------+---------+----------------------------------+--------------------------+--------------------------+------------+
  | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
  +------------------+---------+----------------------------------+--------------------------+--------------------------+------------+
  | ************** | started | ************** | https://**************:2380 | https://**************:2379 | false |
  | ************** | started | ************** | https://**************:2380 | https://**************:2379 | false |
  | ************** | started | ************** | https://**************:2380 | https://**************:2379 | false |
  +------------------+---------+----------------------------------+--------------------------+--------------------------+------------+
  root@************** [ ~ ]# etcdctl --cluster=true endpoint status -w table
  +--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
  | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
  +--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
  | https://**************:2379 | ************** | 3.5.11 | 176 MB | false | false | 83 | 584410086 | 584410086 | |
  | https://**************:2379 | ************** | 3.5.11 | 176 MB | true | false | 83 | 584410086 | 584410086 | |
  | https://**************:2379 | ************** | 3.5.11 | 176 MB | false | false | 83 | 584410086 | 584410086 | |
  +--------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Get status of tkc's api-server & test its connectivity to capi-controllers
1. Get the svc corresponding to the namespace:
  
  kubectl get svc -n <namespace>
  
  Eg:
  [ ~ ]# kubectl get svc -n namespace01
  NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
  tkc-xsmall-control-plane-service LoadBalancer ##.##.#.## ###.###.#.### 6443/TCP 44d
2. Check the health of the Cluster External IP:
  
  curl -k https://<EXTERNAL-IP obtained from above step>:6443/healthz
  
  e.g. output:
  root@<Supervisor CP node name>[ ~ ]# curl -k https://<EXTERNAL-IP>:6443/healthz
  ok
Restart the CAPI controllers by scaling down and up the replicas of the pods.

Note: Before initiating a restart, ensure that the status of all controllers is verified. If any controller is in an error state, it is essential to collect the WCP logs before proceeding with the restart.

kubectl get pods -A | grep -i capi

1. Verify the number of these controller we need to scale back up. It could be 2 or 3 depending on customer environment)
  
  kubectl get deployment -n vmware-system-capw | grep capi
2. Scaling the controllers to '0' and back to original number)

kubectl scale deployment -n vmware-system-capw --replicas=0 capi-controller-manager
kubectl scale deployment -n vmware-system-capw --replicas=2 capi-controller-manager

kubectl scale deployment -n vmware-system-capw --replicas=0 capi-kubeadm-bootstrap-controller-manager
kubectl scale deployment -n vmware-system-capw --replicas=2 capi-kubeadm-bootstrap-controller-manager

kubectl scale deployment -n vmware-system-capw --replicas=0 capi-kubeadm-control-plane-controller-manager
kubectl scale deployment -n vmware-system-capw --replicas=2 capi-kubeadm-control-plane-controller-manager

Note: Make sure the Guest cluster etcd and API server/load balancer are ok before we attempt the above step.