Problematic Node Causes Apply Change/Bosh Deployment to Hang When Updating Service Instance Deployment

search cancel

Problematic Node Causes Apply Change/Bosh Deployment to Hang When Updating Service Instance Deployment

book

Article ID: 396123

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

An apply change update or manual bosh deploy can stall for hours trying to update a service instance deployment. Running bosh tasks will show the deployment as "processing":

Using environment '##.#.##.#' as client 'ops_manager'

ID    State       Started At                    Finished At  User                                            Deployment                                             Description                                                                                              Result
3705  processing  Wed Apr 30 00:20:09 UTC 2025  -            pivotal-container-service-#####  service-instance_##########  create deployment

When checking the health of workloads, you may see Pods that are down and all the failing pods are concentrated on the same worker node. This indicates there may be an issue with this node.

kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>

Environment

TKGi v1.21.0

Cause

The cause of a problematic node could be vast. To get more information SSH into the node and check the resource usage (disk space, memory, etc) and the logs.

Resolution

A quick emergency solution is to cordon and drain the problematic worker node. The two kubernetes commands are used to safely prepare nodes for maintenance or remove pods/workloads. In this case, we want to move pods to a non-problematic node. Cordon marks nodes as unschedulable and draining evicts all pods from the node. Make sure you have enough node capacity left to handle the evicted workloads being rescheduled.

kubectl cordon $node
kubectl drain $node --ignore-daemonsets

Afterwards, Bosh should kick in and clear the remaining tasks and the node should come back healthy after running another apply change.

Before upgrading or making any changes, it's always best practice to make sure the platform and foundation is healthy. A few checks you can make are the following:

Run tkgi clusters and check the status of all clusters (Status should show succeeded)
Bosh Health to make sure all VMs are in a running state
Bosh Tasks to ensure there are no long-running or stuck tasks
Node Health to make sure all nodes are in a Ready state
Pod Health to make sure no pods are in Error, Pending, or CrashLoopBackOff

Feedback

thumb_up Yes

thumb_down No