When performing an upgrade of a vSphere with Tanzu (VKS) workload cluster that hosts Tanzu Mission Control Self-Managed (TMC SM), users may experience a temporary disruption of the TMC UI.
Symptoms: The TMC Console becomes unavailable, often returning 503 Service Unavailable or failing to load data.
Duration: The downtime typically lasts between 1 to 3 minutes.
Context: This occurs specifically during the "Rolling Update" phase where worker nodes are drained and replaced.
Although the cluster upgrade is a "rolling" process, TMC Self-Managed relies on StatefulSets (specifically Postgres and Redis) to store critical data.
Unlike stateless web applications, these stateful components have strict architectural requirements that prevent instant failover:
Pod Recreation: When a node is drained, the StatefulSet pods must fully terminate, unmount their storage volumes, and restart on a new node.
Leader Election: If the database runs in High Availability (HA) mode, the system must detect the loss of the primary node and elect a new leader (approx. 30–60 seconds).
Dependency Checks: Dependent services (like the UI and API) will pause or restart while waiting for the database to become "Ready" and for internal controllers (like postgres-endpoint-controller) to update connection secrets.
This behavior is expected by design.
No action is required to fix the cluster; the system will self-heal automatically once the new pods initialize and pass their health checks.