During a TKG workload cluster upgrade or orchestration operation, one or more worker nodes remain stuck indefinitely in a Ready,SchedulingDisabled (Cordoned) state.
Attempts to manually uncordon the node may fail or revert immediately to a cordoned state
Reviewing the API server or operator logs show errors similar to:
2026-05-18T09:22:30.989Z [Err-profileMonitor] : TriggerProfileUpdate wc-np1-#### failed with error: Internal error occurred: failed calling webhook "validator.nodeconfig.acm.vmware.com": failed to call webhook: Post "https://nodeconfigvalidator.tca-system.svc:443/validate-nodeconfig?timeout=5s": context deadline exceeded, will retry after 1 second
TCP 5.0
TCA 3.2
The issue is caused by a timeout mismatch within the nodeconfig-operator framework:
nodeconfig-daemon pod on the new node calls the TriggerProfileUpdate function to create a dummy NodeConfig Custom Resource (CR). This CR is required for the operator to start generating the configuration profile for the node.nodeconfig-daemon has an internal timeout of 1 second for this specific function.nodeconfigvalidator webhook takes longer than 1 second to respond, the internal timer expires first, causing the context deadline exceeded error. This forces the pod into a retry loop, preventing it from advancing to the next phases where it would update the fields necessary for the node to be uncordoned.To recover the node, restart the nodeconfig-daemon pod on the affected worker node.
nodeconfig-daemon pods on the affected node:kubectl get pods -n tca-system | grep nodeconfig-daemonnodeconfig-daemon pods by deleting them :kubectl delete pod <node-config-daemon-pod-name> -n tca-systemkubectl get nodes