TKG cluster Pods are not scheduling

Products

VMware Tanzu Kubernetes Grid Management VMware Cloud Director

Issue/Introduction

In this scenario, a Guest cluster is stuck in scheduling disabled state with machines in a deleting state. Something has been done to the cluster where all of the nodepools machines are in a deleting state -

The control plane is healthy and working

Example below:

vcdmachinetemplates.infrastructure.cluster.x-k8s.io      2024-02-14T12:09:21Z
NAME                                    AGE
<NAMESPACE>                             456d
<NAMESPACE>                             456d
<NAMESPACE>                             615d


NAMESPACE      NAME                                                PHASE      AGE   VERSION
<NAMESPACE>    machine.cluster.x-k8s.io/<CLUSTER>-<NODEPOOL>-<ID>  Deleting   456d  v1.24.11+vmware.1
<NAMESPACE>    machine.cluster.x-k8s.io/<CLUSTER>-<NODEPOOL>-<ID>  Deleting   456d  v1.24.11+vmware.1
<NAMESPACE>    machine.cluster.x-k8s.io/<CLUSTER>-<NODEPOOL>-<ID>  Running    615d  v1.24.11+vmware.1

CLUSTER       NODENAME         PROVIDERID
<CLUSTER>     <NODE_NAME>      vmware-cloud-<ID>
<CLUSTER>     <NODE_NAME>      vmware-cloud-<ID>
<CLUSTER>     <NODE_NAME>      vmware-cloud-<ID>

When looking at the pods in the cluster they are all in a pending state:

argocd                            dex-server-6cd8cfc9d4-v5j                       0/1  Pending  0  111m  <none>  <none>
argocd                            notifications-controller-6d8fc9d4-v9k           0/1  Pending  0  111m  <none>  <none>
argocd                            redis-7dc8b4fbd9-kgqw7                          0/1  Pending  0  111m  <none>  <none>
argocd                            repo-server-848b86fcf6-6l2                      0/1  Pending  0  111m  <none>  <none>
argocd                            server-7b895dc7d9-zfhf7                         0/1  Pending  0  111m  <none>  <none>
capi-kubeadm-bootstrap-system     capi-kubeadm-bootstrap-controller-manager-0bb   1/1  Running  0  27h  <NODEPOOL>
capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager   1/1  Running  0  27h  <NODEPOOL>
capi-system                       capi-controller-manager-6                       1/1  Running  0  27h  <NODEPOOL>
capvcd-system                     capvcd-controller-manager                       0/1  Pending  0  111m <none> <none>
cert-manager                      cert-manager-6d74f84b64-f                       0/1  Pending  0  111m <none> <none>

When looking at a describe we also see that the machines are attempting to drain

Type     Reason                    Age                    From                           Message
----     ------                    ----                   ----                           -------
Normal   SuccessfulDrainNode       3m7s (x1014 over 26h)  machine-controller             success draining Machine's node "<NODE>..."
Normal   MachineMarkedUnhealthy    2m40s (x5621 over 25h) machinehealthcheck-controller  Machine <NAMESPACE>/<MACHINE_HEALTH_CHECK>/<MACHINE>

Environment

TKGM 2.x

Cloud Director

Cause

This is caused because during changes to the environment the new node that would be coming up to replace the old node of the workload was removed somehow; leaving the cluster in a state were all of the nodes in the node pool are deleting and all the pods go in a pending state because there is not a healthy node to move to.

Resolution

To get out of this situation we can move the important pods temporarily to get the worker nodes to create a new node.

THIS MUST BE REVERTED AND SHOULD NOT BE LEFT IN THIS STATE
***This should be done only to recover the cluster if needed***

These pods are dependent on the environment on what pods will need to be edited. In our example scenario this is was on a Cloud Director cluster

kubectl edit deployments.apps -n capvcd-system capvcd-controller-manager

Add the below:

   tolerations:

      - effect: NoSchedule

        key: node-role.kubernetes.io/master

      - effect: NoSchedule

        key: node-role.kubernetes.io/control-plane

kubectl edit deployments.apps -n rdeprojector-system rdeprojector-controller-manager

tolerations:

      - key: CriticalAddonsOnly

        operator: Exists

      - effect: NoSchedule

        key: node-role.kubernetes.io/control-plane

This will schedule the Pods on the control plane so that the cluster can recover from this state.
Once this has been recovered; revert back these changes so that they are scheduled on the nodepool where they should be