An apply change update or manual bosh deploy can stall for hours trying to update a service instance deployment. Running bosh tasks will show the deployment as "processing":
Using environment '##.#.##.#' as client 'ops_manager'
ID State Started At Finished At User Deployment Description Result
3705 processing Wed Apr 30 00:20:09 UTC 2025 - pivotal-container-service-##### service-instance_########## create deployment
When checking the health of workloads, you may see Pods that are down and all the failing pods are concentrated on the same worker node. This indicates there may be an issue with this node.
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>
TKGi v1.21.0
The cause of a problematic node could be vast. To get more information SSH into the node and check the resource usage (disk space, memory, etc) and the logs.
A quick emergency solution is to cordon and drain the problematic worker node. The two kubernetes commands are used to safely prepare nodes for maintenance or remove pods/workloads. In this case, we want to move pods to a non-problematic node. Cordon marks nodes as unschedulable and draining evicts all pods from the node. Make sure you have enough node capacity left to handle the evicted workloads being rescheduled.
kubectl cordon $node
kubectl drain $node --ignore-daemonsets
Afterwards, Bosh should kick in and clear the remaining tasks and the node should come back healthy after running another apply change.
Before upgrading or making any changes, it's always best practice to make sure the platform and foundation is healthy. A few checks you can make are the following: