Worker nodes stuck in ready,schedulingdisabled during tkg upgrade
search cancel

Worker nodes stuck in ready,schedulingdisabled during tkg upgrade

book

Article ID: 440729

calendar_today

Updated On:

Products

VMware Telco Cloud Automation VMware Telco Cloud Platform

Issue/Introduction

During a TKG workload cluster upgrade or orchestration operation, one or more worker nodes remain stuck indefinitely in a Ready,SchedulingDisabled (Cordoned) state.

Attempts to manually uncordon the node may fail or revert immediately to a cordoned state

Reviewing the API server or operator logs show errors similar to: 

2026-05-18T09:22:30.989Z [Err-profileMonitor] : TriggerProfileUpdate wc-np1-#### failed with error: Internal error occurred: failed calling webhook "validator.nodeconfig.acm.vmware.com": failed to call webhook: Post "https://nodeconfigvalidator.tca-system.svc:443/validate-nodeconfig?timeout=5s": context deadline exceeded, will retry after 1 second

 

Environment

TCP 5.0

TCA 3.2

Cause

The issue is caused by a timeout mismatch within the nodeconfig-operator framework:

  1. Function Stall: The nodeconfig-daemon pod on the new node calls the TriggerProfileUpdate function to create a dummy NodeConfig Custom Resource (CR). This CR is required for the operator to start generating the configuration profile for the node.

  2. Internal Timeout: While the Kubernetes API server uses a 5-second timeout for the admission webhook, the nodeconfig-daemon has an internal timeout of 1 second for this specific function.

  3. Execution Loop: If the nodeconfigvalidator webhook takes longer than 1 second to respond, the internal timer expires first, causing the context deadline exceeded error. This forces the pod into a retry loop, preventing it from advancing to the next phases where it would update the fields necessary for the node to be uncordoned.

Resolution

To recover the node, restart the nodeconfig-daemon pod on the affected worker node.

  1. Log in to the affected workload cluster context as the capv user.
  2. Identify the nodeconfig-daemon pods on the affected node:
    kubectl get pods -n tca-system | grep nodeconfig-daemon
  3. Restart the nodeconfig-daemon pods by deleting them :
    kubectl delete pod <node-config-daemon-pod-name> -n tca-system
  4. Monitor the reconciliation: Allow the newly initialized manager instance to complete the verification handshake with the API server.

  5. Verify Node Status: Confirm the node cleanly transitions to Ready status and automatically uncordons:
    kubectl get nodes