Tanzu Kubernetes Cluster Upgrade Stalls During Rollout Deletion Due to TMC Agent Pods
search cancel

Tanzu Kubernetes Cluster Upgrade Stalls During Rollout Deletion Due to TMC Agent Pods

book

Article ID: 430591

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • Tanzu Kubernetes Cluster (TKC) upgrades stall during the final MachineDeployment rollout deletion when the last worker node fails to drain due to non-terminating pods. The upgrade controller reports a MachineDeploymentsUpgradePending status, with the node remaining in the DrainingNode stage for an extended period.
  • The following error messages and logs are observed:
    • Machine deletion in progress since more than 15m, stage: DrainingNode

    • Cluster API (CAPI) logs report: Drain not completed yet... Pods not terminating

    • Eviction triggers indicate: PodsToTriggerEvictionNow: vmware-system-tmc/cluster-auth-pinniped-kube-cert-agent-*

  • The drain process is actively blocked by the cluster-auth-pinniped-kube-cert-agent pod within the vmware-system-tmc namespace.

Environment

VMware vSphere Kubernetes Service

Cause

Tanzu Mission Control (TMC) agent pod finalizers prevent pod termination during the automated node drain sequence. When the machine controller executes a node drain, the TMC pod finalizer blocks eviction, causing the pod eviction timeout to be reached. Consequently, the node drain stalls, which triggers a machine deletion timeout and permanently blocks the TKC rollout.

Resolution

To resolve this issue, scale down the blocking deployment to allow the machine drain to complete.

  1. Scale the identified blocking deployment down to 0 replicas to bypass the finalizer lock:

    kubectl scale deployment cluster-auth-pinniped-kube-cert-agent --replicas=0 -n vmware-system-tmc

  2. Monitor the machine deletion process to confirm it proceeds successfully:

    kubectl get machine <machine-name> -n <namespace> -w
  1. Verify that the machine status successfully transitions to Deleted.

  2. Monitor the TKC object to verify the rollout automatically completes once the node is removed:

    kubectl get tanzukubernetescluster <cluster-name> -w