vSphere Workload Cluster Control Plane or vSphere Supervisor Cluster Control Plane Missing ROLES

Products

VMware vSphere 7.0 with Tanzu vSphere with Tanzu Tanzu Kubernetes Runtime

Issue/Introduction

In a vSphere workload cluster or Supervisor cluster, one or more control planes are missing the expected roles.

Confirm on which context and cluster is affected and check the appropriate section's steps:

Supervisor Cluster Control Plane for Supervisor cluster control plane VMs
vSphere Workload Cluster Control Plane for guest cluster control plane nodes

Supervisor Cluster Control Plane

While connected to the Supervisor cluster context, the following symptoms are present:

One or more Supervisor cluster control plane vm(s) do not have both "control-plane,master" roles or shows <none> for ROLES:
- ```
kubectl get nodes

NAME                           STATUS          ROLES
<supervisor-vm-dns-name-1>     READY           control-plane
<supervisor-vm-dns-name-2>     READY           <none>
<supervisor-vm-dns-name-3>     READY           master
```
  - If one or both of the above roles are missing, this will cause issues in the environment regarding pod scheduling and cluster management.
  - Note: ESXi hosts are expected to only have the agent role.
One or more system pods are stuck in Pending state:
- ```
kubectl get pods -A | grep -v Run
```
There are 2 or fewer replicas of the same Pending pod in Running state on the other, healthy Supervisor control plane VMs:
- ```
kubectl get pods -o wide -n <Pending pod namespace>
```
Describing the pods stuck in Pending state return an error message similar to the following, where the toleration is searching for a Ready control plane node with the expected role:
- ```
kubectl describe pod <Pending pod name> -n <Pending pod namespace>
```
- ```
X node(s) didn't match pod affinity/selector.
```

vSphere Workload Cluster Control Plane

While connected to the Supervisor cluster context, one or more of following symptoms are present for the affected vSphere workload cluster:

All control plane nodes for the affected cluster show Running state:
- ```
kubectl get machines -n <cluster namespace>
```

A describe of the affected cluster's TKC shows that not all control plane nodes are healthy:

kubectl describe tkc <cluster name> -n <cluster namespace>

Message:     0/3 Control Plane Node(s) healthy

The kubeadmcontrolplane (kcp) for the affected cluster shows that not all control planes are Available:
- ```
kubectl get kcp -n <cluster namespace>
```
Describing the kubeadmcontrolplane (kcp) for the affected cluster shows an error message similar to the following, where the name of the control plane(s) will vary by environment:
- ```
kubectl describe kcp <kcp name> -n <cluster namespace>
```
- ```
Message:    Following machines are reporting control plane errors: <my-control-plane-abcde>
```

While connected to the vSphere workload cluster context, the following symptoms are present:

For TKRs v1.20, v1.21, v1.22 and v1.23, one or more cluster control plane nodes do not have both "control-plane,master" roles or shows <none> for ROLES:

kubectl get nodes

NAME                                       STATUS            ROLES
<guest-cluster-control-plane-vm-a>         Ready             control-plane
<guest-cluster-control-plane-vm-b>         Ready             master
<guest-cluster-control-plane-vm-c>         Ready             <none>

For TKRs v1.24 and higher, one or more cluster control plane nodes do not have the "control-plane" role, showing <none> for ROLES:

kubectl get nodes

NAME                                  STATUS         ROLES
<guest-cluster-control-plane-vm-a>    Ready          <none>
<guest-cluster-control-plane-vm-a>    Ready          <none>
<guest-cluster-control-plane-vm-a>    Ready          <none>

Each control-plane must have a control-plane label that points to an empty value or empty string:

kubectl get nodes --show-labels | grep "node-role.kubernetes.io/control-plane=," -o
node-role.kubernetes.io/control-plane=,
node-role.kubernetes.io/control-plane=,
node-role.kubernetes.io/control-plane=,

If one or both of the above roles are missing, this will cause issues in the environment regarding pod scheduling and cluster management.
- Note: Worker nodes are not expected to have roles and will show as <none>
One or more system pods are stuck in Pending state:
- ```
kubectl get pods -A | grep -v Run
```
There are 2 or fewer replicas of the same Pending pod in Running state on the other, healthy control plane nodes:
- ```
kubectl get pods -o wide -n <Pending pod namespace>
```
Describing the pods stuck in Pending state return an error message similar to the following, where the toleration is searching for a Ready control plane node with the expected role:
- ```
kubectl describe pod <Pending pod name> -n <Pending pod namespace>
```
- ```
X node(s) didn't match pod affinity/selector.
```

Environment

vSphere Supervisor

This issue can occur on a cluster regardless of whether or not it is managed by Tanzu Mission Control (TMC)

Cause

An issue occurred when creating the affected control planes where the roles were not assigned by kubeadm.

Many vSphere Kubernetes and kube-system pods have tolerations which are reliant on an available, Ready control plane with expected roles as described in the Issue/Introduction above.

Multiple vSphere Kubernetes and kube-system pods are configured to have multiple replicas where one replica is expected to run per control plane.

These system pods will remain in Pending state until the proper roles are assigned to the control plane(s) which are missing the expected role(s).

This issue can also occur when a control plane is manually deleted but the system did not finish deleting the node, allowing a restart of kubelet to recreate the control plane node.

IMPORTANT: It is not a proper troubleshooting step to delete nodes. Deleting nodes will often worsen the issue. In particular, deleting control plane nodes can leave a cluster broken, unmanageable and unrecoverable.

Resolution

Confirm on which context and cluster is affected by the missing role issue from within the respective context.
- Is this a Supervisor cluster, or a workload cluster control plane node that has missing roles?
```
kubectl get nodes
```
  - Supervisor cluster node names will show as the DNS name of the Supervisor cluster control plane VM.
  - Workload cluster node names will show as the name of the workload cluster but may not contain "control-plane" in the name.
    The KCP object for the cluster will match the name of the control plane nodes.
Note down the version for the affected cluster and context from the above output in the affected context.
- If this is a Supervisor cluster, the roles are expected to be control-plane,master.
  - The control-plane label on the node must point to an empty value or empty string.
  - ESXI hosts are expected to only have the agent role.
- If this is a workload cluster, the TKR version will be important in identifying which roles are expected for the control plane nodes.
  - TKRs v1.19 and lower are expected to only have the role "master"
  - TKRs v1.20, v1.21, v1.22 and v1.23 are expected to have roles "control-plane,master"
  - TKRs v1.24 and higher are expected to only have role "control-plane"
  - The control-plane label for a control plane node should point to an empty value or empty string.
  - Worker nodes are not expected to have roles and will show as <none>

Reach out to VMware by Broadcom Technical Support referencing this KB article for assistance on correcting the missing role(s) on the control plane(s).

Once the missing roles are added, the system should recognize the control plane(s) and Pending pods will be able to schedule into Running state on the restored control plane(s).