This article provides a detailed workaround for an issue encountered during the supervisor upgrade process in vSphere with Tanzu, where the upgrade task is stalled and there is a lack of deployment of new SupervisorControlPlaneVMs.
Primary Symptom: During a supervisor upgrade in vSphere with Tanzu, the upgrade process is observed to be stuck on the task "Upgrade Namespace Cluster" with a status of "0%". No new SupervisorControlPlaneVMs are deployed.
Secondary Symptom: Specific error messages are noted in the wcpsvc.log file, which are indicative of the underlying issue. Key log entries include:
These log entries provide key insights into the failures occurring during the upgrade process and are critical for diagnosing the issue.
... error wcp [kubelifecycle/compupgrade.go:70] [opID=upgrade-domain-c] vm is not available yet
... error wcp [kubelifecycle/upgrade_controller.go:1484] [opID=upgrade-domain-c#] failed to check if component upgrade is pending: vm is not available yet
... debug wcp [kubelifecycle/upgrade_controller.go:429] [opID=upgrade-domain-c#] error determining current upgrade step: vm is not available yet
... debug wcp [kubelifecycle/upgrade_controller.go:278] [opID=upgrade-domain-c#] upgrade controller retrying
The primary cause of the upgrade issue in vSphere with Tanzu during supervisor upgrades is a bug in the version comparison logic. This bug becomes evident when upgrading between different versions of the system (the versions mentioned here are examples and the issue may occur with other versions as well).
In detail:
This misunderstanding in the version comparison results in the upgrade process halting at 0%, as the system falsely believes the current version is already up-to-date or higher than the intended upgrade version.
Issue is fixed in 7.0U3Q - 23788036, 8.0U1 - 21560480
A workaround is proposed to address the problem by changing the values in the database and generation numbers on each Supervisor control plane node vm's:
Steps:
Cancel Upgrade Task:
Verify Upgrade Status:
dcli
command to check the upgrade status. The expected output should indicate an ERROR state. For example:dcli> com vmware vcenter namespacemanagement software clusters get --cluster domain-c#
upgrade_status:
desired_version: v1.22.6+vmware.1-vsc0.0.17-57155###
messages:
- severity: ERROR
details: Task for upgrade has been cancelled by user.
progress:
total: 100
completed: 0
message:
available_versions:
- v1.22.6+vmware.1-vsc0.0.17-57155###
current_version: v1.21.0+vmware.1-vsc0.0.11-18610###
messages:
state: ERROR
last_upgraded_date: YYYY-MM-HHTHH:MM:SS.000Z
Stop wcpsvc Service:
Update Database Entries:
vCenter Server 7.0:
# PGPASSFILE=/etc/vmware/wcp/.pgpass /opt/vmware/vpostgres/current/bin/psql -U wcpuser -d VCDB -h localhost
vCenter Server 8.0:
# /opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres
cluster_db_configs
table:vCenter Server 7.0:
VCDB=> select cluster from cluster_db_configs;
vCenter Server 8.0:
VCDB=> select cluster from wcp.cluster_db_configs;
master_gen
and desired_master_gen
fields to 0 for the affected cluster:vCenter Server 7.0:
VCDB=> update cluster_db_configs set desired_master_gen = 0 where cluster = 'domain-c#:<GUID>';
VCDB=> update cluster_db_configs set master_gen = 0 where cluster = 'domain-c#:<GUID>';
vCenter Server 8.0:
VCDB=> update wcp.cluster_db_configs set desired_master_gen = 0 where cluster = 'domain-c#:<GUID>';
VCDB=> update wcp.cluster_db_configs set master_gen = 0 where cluster = 'domain-c#:<GUID>';
Update Generation Number on CP VMs:
Example of the line to modify in wcp_versions.yaml: wcp_version: vsc0.0.23-21953###-9 -> Change this to vsc0.0.23-21953###-0
Start wcpsvc Service:
Re-trigger Upgrade: