TKG cluster Pods are not scheduling
search cancel

TKG cluster Pods are not scheduling

book

Article ID: 415385

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Management VMware Cloud Director

Issue/Introduction





In this scenario, a Guest cluster is stuck in scheduling disabled state with machines in a deleting state. Something has been done to the cluster where all of the nodepools machines are in a deleting state - 


The control plane is healthy and working 

Example below: 

vcdmachinetemplates.infrastructure.cluster.x-k8s.io      2024-02-14T12:09:21Z
NAME                                    AGE
<NAMESPACE>                             456d
<NAMESPACE>                           456d
<NAMESPACE>                           615d

NAMESPACE      NAME                                                PHASE     AGE   VERSION
<NAMESPACE> machine.cluster.x-k8s.io/<CLUSTER>-<NODEPOOL>-<ID>  Deleting   456d  v1.24.11+vmware.1
<NAMESPACE>   machine.cluster.x-k8s.io/<CLUSTER>-<NODEPOOL>-<ID>  Deleting   456d  v1.24.11+vmware.1
<NAMESPACE>   machine.cluster.x-k8s.io/<CLUSTER>-<NODEPOOL>-<ID>  Running    615d  v1.24.11+vmware.1

CLUSTER       NODENAME         PROVIDERID
<CLUSTER>     <NODE_NAME>     vmware-cloud-<ID>
<CLUSTER>   <NODE_NAME>      vmware-cloud-<ID>
<CLUSTER>     <NODE_NAME>      vmware-cloud-<ID>



When looking at the pods in the cluster they are all in a pending state: 

argocd                            dex-server-6cd8cfc9d4-v5j                       0/1  Pending  0  111m  <none>  <none>
argocd           notifications-controller-6d8fc9d4-v9k         0/1  Pending  0  111m  <none>  <none>
argocd           redis-7dc8b4fbd9-kgqw7                         0/1  Pending  0  111m  <none>  <none>
argocd           repo-server-848b86fcf6-6l2                     0/1  Pending  0  111m  <none>  <none>
argocd           server-7b895dc7d9-zfhf7                         0/1  Pending  0  111m  <none>  <none>
capi-kubeadm-bootstrap-system   capi-kubeadm-bootstrap-controller-manager-0bb   1/1  Running  0  27h  <NODEPOOL>
capi-kubeadm-control-plane-system capi-kubeadm-control-plane-controller-manager   1/1  Running  0  27h  <NODEPOOL>
capi-system                     capi-controller-manager-6                       1/1  Running  0  27h  <NODEPOOL>
capvcd-system                   capvcd-controller-manager                       0/1  Pending  0  111m <none> <none>
cert-manager                     cert-manager-6d74f84b64-f                     0/1  Pending  0  111m <none> <none>



When looking at a describe we also see that the machines are attempting to drain

Type     Reason                    Age                    From                           Message
----     ------                    ----                   ----                           -------
Normal   SuccessfulDrainNode       3m7s (x1014 over 26h)  machine-controller             success draining Machine's node "<NODE>..."
Normal   MachineMarkedUnhealthy    2m40s (x5621 over 25h) machinehealthcheck-controller  Machine <NAMESPACE>/<MACHINE_HEALTH_CHECK>/<MACHINE>

 

Environment

TKGM 2.x

Cloud Director 

Cause

This is caused because during changes to the environment the new node that would be coming up to replace the old node of the workload was removed somehow; leaving the cluster in a state were all of the nodes in the node pool are deleting and all the pods go in a pending state because there is not a healthy node to move to. 

Resolution

To get out of this situation we can move the important pods temporarily to get the worker nodes to create a new node. 

THIS MUST BE REVERTED AND SHOULD NOT BE LEFT IN THIS STATE
***This should be done only to recover the cluster if needed***

These pods are dependent on the environment on what pods will need to be edited. In our example scenario this is was on a Cloud Director cluster 

 

kubectl edit deployments.apps -n capvcd-system    capvcd-controller-manager


Add the below: 

   tolerations:

      - effect: NoSchedule

        key: node-role.kubernetes.io/master

      - effect: NoSchedule

        key: node-role.kubernetes.io/control-plane

 

kubectl edit deployments.apps -n rdeprojector-system rdeprojector-controller-manager

tolerations:

      - key: CriticalAddonsOnly

        operator: Exists

      - effect: NoSchedule

        key: node-role.kubernetes.io/control-plane



This will schedule the Pods on the control plane so that the cluster can recover from this state. 
Once this has been recovered; revert back these changes so that they are scheduled on the nodepool where they should be