In this scenario, a Guest cluster is stuck in scheduling disabled state with machines in a deleting state. Something has been done to the cluster where all of the nodepools machines are in a deleting state -
The control plane is healthy and working
Example below:
vcdmachinetemplates.infrastructure.cluster.x-k8s.io 2024-02-14T12:09:21Z
NAME AGE
<NAMESPACE> 456d
<NAMESPACE> 456d
<NAMESPACE> 615d
NAMESPACE NAME PHASE AGE VERSION
<NAMESPACE> machine.cluster.x-k8s.io/<CLUSTER>-<NODEPOOL>-<ID> Deleting 456d v1.24.11+vmware.1
<NAMESPACE> machine.cluster.x-k8s.io/<CLUSTER>-<NODEPOOL>-<ID> Deleting 456d v1.24.11+vmware.1
<NAMESPACE> machine.cluster.x-k8s.io/<CLUSTER>-<NODEPOOL>-<ID> Running 615d v1.24.11+vmware.1
CLUSTER NODENAME PROVIDERID
<CLUSTER> <NODE_NAME> vmware-cloud-<ID>
<CLUSTER> <NODE_NAME> vmware-cloud-<ID>
<CLUSTER> <NODE_NAME> vmware-cloud-<ID>
When looking at the pods in the cluster they are all in a pending state:
argocd dex-server-6cd8cfc9d4-v5j 0/1 Pending 0 111m <none> <none>
argocd notifications-controller-6d8fc9d4-v9k 0/1 Pending 0 111m <none> <none>
argocd redis-7dc8b4fbd9-kgqw7 0/1 Pending 0 111m <none> <none>
argocd repo-server-848b86fcf6-6l2 0/1 Pending 0 111m <none> <none>
argocd server-7b895dc7d9-zfhf7 0/1 Pending 0 111m <none> <none>
capi-kubeadm-bootstrap-system capi-kubeadm-bootstrap-controller-manager-0bb 1/1 Running 0 27h <NODEPOOL>
capi-kubeadm-control-plane-system capi-kubeadm-control-plane- controller-manager 1/1 Running 0 27h <NODEPOOL>
capi-system capi-controller-manager-6 1/1 Running 0 27h <NODEPOOL>
capvcd-system capvcd-controller-manager 0/1 Pending 0 111m <none> <none>
cert-manager cert-manager-6d74f84b64-f 0/1 Pending 0 111m <none> <none>
When looking at a describe we also see that the machines are attempting to drain
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulDrainNode 3m7s (x1014 over 26h) machine-controller success draining Machine's node "<NODE>..."
Normal MachineMarkedUnhealthy 2m40s (x5621 over 25h) machinehealthcheck-controller Machine <NAMESPACE>/<MACHINE_HEALTH_CHECK>/<MACHINE>
TKGM 2.x
Cloud Director
This is caused because during changes to the environment the new node that would be coming up to replace the old node of the workload was removed somehow; leaving the cluster in a state were all of the nodes in the node pool are deleting and all the pods go in a pending state because there is not a healthy node to move to.
To get out of this situation we can move the important pods temporarily to get the worker nodes to create a new node.
THIS MUST BE REVERTED AND SHOULD NOT BE LEFT IN THIS STATE
***This should be done only to recover the cluster if needed***
These pods are dependent on the environment on what pods will need to be edited. In our example scenario this is was on a Cloud Director cluster
kubectl edit deployments.apps -n capvcd-system capvcd-controller-manager
Add the below:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
kubectl edit deployments.apps -n rdeprojector-system rdeprojector-controller-manager
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
This will schedule the Pods on the control plane so that the cluster can recover from this state.
Once this has been recovered; revert back these changes so that they are scheduled on the nodepool where they should be