Tanzu Hub Upgrade or Configuration Change Hangs During BOSH Deployment

Products

VMware Tanzu Platform VMware Tanzu Platform - Hub

Issue/Introduction

When upgrading Tanzu Hub or applying a configuration change, the BOSH task appears to hang for an extended period.

To confirm, get a list of all running BOSH tasks and identify the one in a processing state.

 bosh tasks

Example Output:

    Using environment '10.###.###.###' as client 'ops_manager'
    ID   State       Started At                    Finished At  User         Deployment                Description        Result
    215  processing  Tue Sep 30 08:40:43 UTC 2025  -            ops_manager  hub-####################  create deployment

Inspect the events for the processing task to find which instance group is stuck. You will see an instance group stuck in the pre-stop phase.

    # Check all events for the stuck task ID
    bosh task 215 --event | grep -n -C 3 '"stage":"Updating instance"'
    
    # Or filter by a specific instance group, for example 'control'
    bosh task 215 --event | grep -n -C 3 '"stage":"Updating instance"' | grep control

Example Output:

   151:{"time":1759222920,"stage":"Updating instance","tags":["control"],"total":3,"task":"control/########-####-####-####-############ (1)","index":2,"state":"started","progress":0}
    152:{"time":1759222921,"stage":"Updating instance","tags":["control"],"total":3,"task":"control/########-####-####-####-############ (1)","index":2,"state":"in_progress","progress":5,"data":{"status":"executing pre-stop"}}

SSH into the stuck instance VM to investigate further.

 bosh ssh control/########-####-####-####-########### -d hub-####################

Check the drain.stderr.log, which will show that a pod cannot be evicted due to its PodDisruptionBudget (PDB).

    less /var/vcap/sys/log/kubelet/drain.stderr.log

Example Log Output:

 error when evicting pods/"contour-envoy-############-#####" -n "tanzusm" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

Cause

During an upgrade or deployment of a configuration change, BOSH attempts to safely drain Kubernetes nodes before updating them. This process can get stuck if a pod on the node cannot be evicted. The eviction fails because of a PodDisruptionBudget (PDB), which enforces a minimum availability for a set of pods.This situation arises under two primary conditions:

Node Unavailability for Rescheduling: A pod cannot be rescheduled because no other suitable nodes are available.

This can happen if an entire instance group (like clickhouse-logs) is being removed during an upgrade, especially if it had been scaled to more than one instance. The PDB prevents pod eviction because there are no available nodes with the required taints/tolerations to move the pod to.
Resource Constraints on Available Nodes:

While other nodes are available, they lack sufficient resources (e.g., CPU, memory) for the Kubernetes scheduler to place the pod. This is seen with the contour-envoy pods on control nodes, where the remaining nodes are under heavy load, preventing the evicted pod from being rescheduled. The PDB for contour-envoy then blocks the drain process, stalling the upgrade.

Resolution

Workaround

To allow the BOSH drain process to complete, you can temporarily delete the PodDisruptionBudgets from the tanzusm namespace. PodDisruptionBudgets will automatically be created later on after installation.

List all BOSH instances for the deployment to identify the registry VM.
```
    bosh instances -d hub-####################
```

SSH into the registry VM.

 bosh ssh registry/########-####-####-####-############ -d hub-####################

If a backup of the PDBs is not available, perform a backup (save backup file poddisruptionbudgets.yaml):
```
kubectl get pdb -n tanzusm -o yaml > poddisruptionbudgets.yaml
```
From the registry VM, delete all PDBs in the tanzusm namespace.
```
 kubectl delete pdb --all -n tanzusm
```

After the PDBs are deleted, the BOSH task should proceed and complete successfully

Solution

This issue is resolved in Tanzu Hub version 10.3 or later. Upgrading to Tanzu Hub 10.3 will prevent the problem from occurring in subsequent configuration updates and upgrades.