When performing a Tanzu Kubernetes Grid Integrated Edition (TKGI) cluster upgrade, the task fails during the worker node update phase. Specifically, the BOSH task logs show a failure around the drain script execution on a worker node.
Task 12345 | 12:11:28 | Updating instance worker/##### (3)
Task 12345 | 12:11:43 | L executing pre-stop: worker/##### (3)
Task 12345 | 12:49:51 | L executing drain: worker/##### (3) (00:43:44)
L Error: Action Failed get_task: Task 9999999 result: Unmonitoring services: Unmonitoring service freshclam: Unmonitoring Monit service freshclam: Request failed with 503 Service Unavailable: <html><head><title>503 Service Unavailable</title></head><body bgcolor=#FFFFFF><h2>Service Unavailable</h2>Other action already in progress -- please try again later<p><hr><a href='http://mmonit.com/monit/'><font size=-1>monit 5.2.5</font></a></body></html>
Tanzu Kubernetes Grid Integrated Edition
The Monit daemon on the worker node is busy or has a locked state, preventing the BOSH agent from successfully executing the 'unmonitor' command during the drain process. This results in a '503 Service Unavailable' response from Monit, which causes the entire upgrade task to fail.
To resolve this issue, you can manually clear the Monit state on the affected worker node and then resume the upgrade.
bosh -d $deployment ssh $workermonit unmonitor alltkgi upgrade-cluster $clustermonit monitor allmonit summary