Worker Node Enters "Ready, SchedulingDisabled" State After Reboot

Products

VMware Tanzu Kubernetes Grid Management

Issue/Introduction

When a worker node is rebooted, it will come back online in a 'Ready, SchedulingDisabled' state:

NAME STATUS ROLES AGE VERSION
mit-shared-prod-test-1-cdm64-xmlsf Ready control-plane 2d4h v1.27.5+vmware.1
mit-shared-prod-test-1-lin-0-v5g7p-cfbbbc7f7xf55qp-wt58q Ready,SchedulingDisabled <none> 29h v1.27.5+vmware.1

You will also notice that when checking the machine's status from the management cluster context, it appears in a deleting state.

Cause

Resolution

When the MachineHealthCheck (MHC) is configured on a workload cluster, the default timeout before a machine is marked as unhealthy and scheduled for deletion is 5 minutes. This issue arises when, from the management cluster's perspective, the nodes take longer than 5 minutes to come back online. There are two possible reasons for this issue:

The MachineHealthCheck (MHC) is configured to recreate a machine if it's in an unhealthy state for less than 5 minutes. For instance, if the MHC is set to recreate a machine after it remains in a "Not Ready" state for 2 minutes, it may not provide sufficient time for the machine to recover and come back online. To address this, the MHC can be configured with more lenient thresholds. For example:tanzu cluster machinehealthcheck node set test-cluster --unhealthy-conditions "Ready:False:5m,Ready:Unknown:5m"-

REF: https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/2.5/using-tkg/workload-clusters-mhc.html

Another possible cause for this issue is a time discrepancy between the workload cluster and the management cluster. Since it typically takes around two -> three minutes for a node to transition from "NotReady" to "Ready" after a reboot, a time difference of more than three minutes can lead to the node being flagged for deletion if the MHC is using the default 5-minute constraints.

To address this, ensure proper time synchronisation by configuring NTP servers on both the management and workload clusters, as outlined in this KB article:

https://knowledge.broadcom.com/external/article?articleNumber=337407

As a temporary workaround, you can adjust the MachineHealthCheck (MHC) configuration to be more lenient. For example, you can configure the MHC to delete a node only after it has been in a "NotReady" state for 10 minutes or more. This adjustment should provide enough time for the cluster to come back online, even when there is a significant time discrepancy.

tanzu cluster machinehealthcheck node set test-cluster --unhealthy-conditions "Ready:False:10m,Ready:Unknown:10m"

The "SchedulingDisabled" component of this issue arises when a node is marked for deletion but cannot be properly drained. This is typically caused by a PodDisruptionBudget (PDB) that restricts the draining process. This situation is more commonly encountered in clusters where a single control plane and worker node are connected to TMC, often due to the "gatekeeper-controller" PodDisruptionBudget. To resolve this and allow the node to complete its deletion, there are two potential approaches:

Force delete all pods in the "gatekeeper-system" namespace:

kubectl delete pod -n gatekeeper-system --all --force

Scale the cluster from one worker node to three. This will enable the gatekeeper pods to be rescheduled onto another node, allowing the problematic node to finish draining and be redeployed:

tanzu cluster scale test-cluster --controlplane-machine-count 1 --worker-machine-count 3

Note: It is not advisable to connect a single control plane and worker node cluster to TMC. If an upgrade or machine recreation occurs for reasons not outlined in this article, the worker node may encounter difficulties during the draining process.