Short downtime during TMC Self-Managed cluster upgrade
search cancel

Short downtime during TMC Self-Managed cluster upgrade

book

Article ID: 427797

calendar_today

Updated On:

Products

VMware Tanzu Platform - Kubernetes VMware Tanzu Mission Control - SM

Issue/Introduction

When performing an upgrade of a vSphere with Tanzu (VKS) workload cluster that hosts Tanzu Mission Control Self-Managed (TMC SM), users may experience a temporary disruption of the TMC UI.

Symptoms: The TMC Console becomes unavailable, often returning 503 Service Unavailable or failing to load data.

Duration: The downtime typically lasts between 1 to 3 minutes.

Context: This occurs specifically during the "Rolling Update" phase where worker nodes are drained and replaced.

Cause

Although the cluster upgrade is a "rolling" process, TMC Self-Managed relies on StatefulSets (specifically Postgres and Redis) to store critical data.

Unlike stateless web applications, these stateful components have strict architectural requirements that prevent instant failover:

  1. Pod Recreation: When a node is drained, the StatefulSet pods must fully terminate, unmount their storage volumes, and restart on a new node.

  2. Leader Election: If the database runs in High Availability (HA) mode, the system must detect the loss of the primary node and elect a new leader (approx. 30–60 seconds).

  3. Dependency Checks: Dependent services (like the UI and API) will pause or restart while waiting for the database to become "Ready" and for internal controllers (like postgres-endpoint-controller) to update connection secrets.

Resolution

This behavior is expected by design.

No action is required to fix the cluster; the system will self-heal automatically once the new pods initialize and pass their health checks.