vSphere with Tanzu Supervisor Upgrade stuck at 0%

Products

VMware vSphere ESXi VMware vSphere Kubernetes Service

Issue/Introduction

This article provides a detailed workaround for an issue encountered during the supervisor upgrade process in vSphere with Tanzu, where the upgrade task is stalled and there is a lack of deployment of new SupervisorControlPlaneVMs.

Primary Symptom: During a supervisor upgrade in vSphere with Tanzu, the upgrade process is observed to be stuck on the task "Upgrade Namespace Cluster" with a status of "0%". No new SupervisorControlPlaneVMs are deployed.
Secondary Symptom: Specific error messages are noted in the wcpsvc.log file, which are indicative of the underlying issue. Key log entries include:

These log entries provide key insights into the failures occurring during the upgrade process and are critical for diagnosing the issue.
- Errors related to the retrieval and comparison of WCP versions, such as:

... error wcp [kubelifecycle/compupgrade.go:70] [opID=upgrade-domain-c] vm is not available yet
... error wcp [kubelifecycle/upgrade_controller.go:1484] [opID=upgrade-domain-c#] failed to check if component upgrade is pending: vm is not available yet

Debug messages indicating issues with the upgrade controller:

... debug wcp [kubelifecycle/upgrade_controller.go:429] [opID=upgrade-domain-c#] error determining current upgrade step: vm is not available yet
... debug wcp [kubelifecycle/upgrade_controller.go:278] [opID=upgrade-domain-c#] upgrade controller retrying

Environment

VMware vSphere 7.0 with Tanzu

VMware vSphere 8.0 with Tanzu

Cause

The primary cause of the upgrade issue in vSphere with Tanzu during supervisor upgrades is a bug in the version comparison logic. This bug becomes evident when upgrading between different versions of the system (the versions mentioned here are examples and the issue may occur with other versions as well).

In detail:

The system incorrectly interprets the current master wcp_version (e.g., 0.0.23-21953###-9) during the upgrade process.
Conversely, the desired wcp version for the upgrade (e.g., 0.0.23-21953###-10) is incorrectly recognized due to a flaw in the version comparison mechanism.
This issue arises because the version comparison logic in the Go library, used by the system, performs an ASCII character comparison. As a result, a version ending in "-9" is incorrectly interpreted as being higher than one ending in "-10", leading to a stall in the upgrade process.

This misunderstanding in the version comparison results in the upgrade process halting at 0%, as the system falsely believes the current version is already up-to-date or higher than the intended upgrade version.

Resolution

Issue is fixed in 7.0U3Q - 23788036, 8.0U1 - 21560480

A workaround is proposed to address the problem by changing the values in the database and generation numbers on each Supervisor control plane node vm's:

Steps:

Cancel Upgrade Task:
- Cancel the upgrade task in progress to ensure it is acknowledged as a failed upgrade due to cancellation.
Verify Upgrade Status:
- Use the dcli command to check the upgrade status. The expected output should indicate an ERROR state. For example:

dcli> com vmware vcenter namespacemanagement software clusters get --cluster domain-c#
upgrade_status:
   desired_version: v1.22.6+vmware.1-vsc0.0.17-57155###
   messages:
      - severity: ERROR
        details: Task for upgrade has been cancelled by user.
   progress:
      total: 100
      completed: 0
      message:
available_versions:
   - v1.22.6+vmware.1-vsc0.0.17-57155###
current_version: v1.21.0+vmware.1-vsc0.0.11-18610###
messages:
state: ERROR
last_upgraded_date: YYYY-MM-HHTHH:MM:SS.000Z

Stop wcpsvc Service:
- Execute the command vmon-cli --stop wcp to stop the wcpsvc service.
Update Database Entries:
- Access the database using the command:

vCenter Server 7.0:
# PGPASSFILE=/etc/vmware/wcp/.pgpass /opt/vmware/vpostgres/current/bin/psql -U wcpuser -d VCDB -h localhost

vCenter Server 8.0:
# /opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres

Identify the affected cluster by querying the cluster_db_configs table:

vCenter Server 7.0:
VCDB=> select cluster from cluster_db_configs;

vCenter Server 8.0:
VCDB=> select cluster from wcp.cluster_db_configs;

Update the master_gen and desired_master_gen fields to 0 for the affected cluster:

vCenter Server 7.0:
VCDB=> update cluster_db_configs set desired_master_gen = 0 where cluster = 'domain-c#:<GUID>';
VCDB=> update cluster_db_configs set master_gen = 0 where cluster = 'domain-c#:<GUID>';

vCenter Server 8.0:
VCDB=> update wcp.cluster_db_configs set desired_master_gen = 0 where cluster = 'domain-c#:<GUID>';
VCDB=> update wcp.cluster_db_configs set master_gen = 0 where cluster = 'domain-c#:<GUID>';

Update Generation Number on CP VMs:
- From the UI, locate the IP addresses of each Control Plane VM in the SupervisorControlPlaneVM section.
- Retrieve root password for the CP VMs using /usr/lib/vmware-wcp/decryptK8Pwd.py.
- SSH into each CP VM and modify the generation number in the wcp_version field in /etc/vmware/wcp/wcp_versions.yaml:
  - ```
  Example of the line to modify in wcp_versions.yaml:
  
  wcp_version: vsc0.0.23-21953###-9 -> Change this to vsc0.0.23-21953###-0
```
Start wcpsvc Service:
- Restart the wcpsvc service using vmon-cli --start wcp.
Re-trigger Upgrade:
- Once the cluster state is RUNNING, re-initiate the supervisor upgrade process.