Updating TKGS from 3.3.2-embedded to 3.3.3-embedded fails | ProgressDeadlineExceeded message: ReplicaSet "capi-kubeadm-bootstrap-controller-manager" has timed out progressing
search cancel

Updating TKGS from 3.3.2-embedded to 3.3.3-embedded fails | ProgressDeadlineExceeded message: ReplicaSet "capi-kubeadm-bootstrap-controller-manager" has timed out progressing

book

Article ID: 409822

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime

Issue/Introduction

Updating TKGS from 3.3.2-embedded to 3.3.3-embedded hangs with error:

Configured Core Supervisor Services
Service: tkg.vsphere.vmware.com. Reason: ReconcileFailed. Message: kapp: Error: waiting on reconcile packageinstall/tanzu-cluster-api-bootstrap-kubeadm (packaging.carvel.dev/v1alpha1) namespace: svc-tkg-domain-####: Finished unsuccessfully (Reconcile failed: (message: kapp: Error: waiting on reconcile deployment/capi-kubeadm-bootstrap-controller-manager (apps/v1) namespace: svc-tkg-domain-#####: Finished unsuccessfully (Deployment is not progressing: ProgressDeadlineExceeded (message: ReplicaSet "capi-kubeadm-bootstrap-controller-manager-##########" has timed out progressing.)))).
Service: velero.vsphere.vmware.com. Status: Running

The failure is accompanied by a pod stuck in Pending state with scheduling events referencing unavailable host ports.

Warning FailedScheduling 2m13s (x37452 over 26d) default-scheduler 0/7 nodes are available: 3 node(s) didn't have free ports for the requested pod ports, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/7 nodes are available: 3 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling.
 
Running the following command shows an extra port in use for 8085:

# k get po -o yaml -A | grep -i hostport | sort | uniq -c | grep -E '9875|9441|8085'
      4         hostPort: 8085
      3         hostPort: 9441
      3         hostPort: 9875

Cause

Both the capi-kubeadm-bootstrap-controller-manager and velero.vsphere.vmware.com Supervisor Services attempt to bind to hostPort 8085. Since hostPort is a node-level resource, only one pod per node may claim a specific hostPort.

Although the capi-kubeadm deployment is configured for two replicas, a third pod was observed in a Pending state during rollout, likely due to a transient update or restart triggering a temporary additional pod. Velero was already occupying hostPort 8085 on one node, and the other two nodes were used by the running capi-kubeadm pods, leaving no node available to schedule the third pod. This resulted in a scheduling deadlock that blocked the deployment and caused the upgrade to stall.

Resolution

Temporarily scale down the Velero Supervisor Service to release hostPort 8085.

kubectl scale deploy/backup-driver -n velero --replicas=0

Once the port is freed, the stuck capi-kubeadm pod will schedule, and the upgrade will proceed. After the rollout completes and the deployment stabilizes with only two replicas, Velero can be safely scaled back up if needed. It will land on the third node where hostPort 8085 is no longer in use, avoiding further conflict.

kubectl scale deploy/backup-driver -n velero --replicas=1