TKGI Cluster Upgrade Fails with Monit 503 Service Unavailable Error during Drain Phase
search cancel

TKGI Cluster Upgrade Fails with Monit 503 Service Unavailable Error during Drain Phase

book

Article ID: 442464

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

When performing a Tanzu Kubernetes Grid Integrated Edition (TKGI) cluster upgrade, the task fails during the worker node update phase. Specifically, the BOSH task logs show a failure around the drain script execution on a worker node.

Task 12345 | 12:11:28 | Updating instance worker/##### (3) 
Task 12345 | 12:11:43 | L executing pre-stop: worker/##### (3) 
Task 12345 | 12:49:51 | L executing drain: worker/##### (3) (00:43:44)
                      L Error: Action Failed get_task: Task 9999999 result: Unmonitoring services: Unmonitoring service freshclam: Unmonitoring Monit service freshclam: Request failed with 503 Service Unavailable: <html><head><title>503 Service Unavailable</title></head><body bgcolor=#FFFFFF><h2>Service Unavailable</h2>Other action already in progress -- please try again later<p><hr><a href='http://mmonit.com/monit/'><font size=-1>monit 5.2.5</font></a></body></html>

 

Environment

Tanzu Kubernetes Grid Integrated Edition

Cause

The Monit daemon on the worker node is busy or has a locked state, preventing the BOSH agent from successfully executing the 'unmonitor' command during the drain process. This results in a '503 Service Unavailable' response from Monit, which causes the entire upgrade task to fail.

Resolution

To resolve this issue, you can manually clear the Monit state on the affected worker node and then resume the upgrade.

Step 1: Identify and Access the Affected Node

  1. Review the failed BOSH task output to identify the specific worker node instance that failed (e.g., worker/#####).
  2. SSH into the affected worker node:
    bosh -d $deployment ssh $worker

Step 2: Clear Monit State

  1. Elevate to root: sudo -i
  2. Manually unmonitor all services to bypass the lock:
    monit unmonitor all

Step 3: Resume the Cluster Upgrade

  1. From the TKGI CLI, re-run the upgrade command for the cluster:
    tkgi upgrade-cluster $cluster
  2. The upgrade should now resume on the worker node that previously failed.

Step 4: Verify and Restore Monitoring

  1. After the node has successfully updated and reached a running state in BOSH, SSH back into the node.
  2. Ensure monitoring is resumed:
    monit monitor all
  3. Check the status to confirm all processes are correctly monitored:
    monit summary