vSphere with Tanzu Supervisor Cluster upgrade to Supervisor version 0.0.15 or later hangs for a few hours at 50%

Products

VMware vCenter Server

Issue/Introduction

Symptoms:
The three new Supervisor Control Plane VM's will spin up and most of the pods will be running.
The /var/log/vmware/wcp/wcpsvc.log on the vCenter will show the following error message.

message : Component TkgUpgrade failed: Failed to run command: ['kubectl', 'apply', '-f', '/usr/lib/vmware-wcp/objects/PodVM-GuestCluster/11-tkgsconfig', '--record'] ret=1 out= err=Flag --record has been deprecated, --record will be removed in the future\nError from server: error when retrieving current configuration of:\nResource: \"run.tanzu.vmware.com/v1alpha2, Resource=tkgserviceconfigurations\", GroupVersionKind: \"run.tanzu.vmware.com/v1alpha2, Kind=TkgServiceConfiguration\"\nName: \"tkg-service-configuration\", Namespace: \"\"\nfrom server for: \"/usr/lib/vmware-wcp/objects/PodVM-GuestCluster/11-tkgsconfig/tkgserviceconfiguration.yaml\": conversion webhook for run.tanzu.vmware.com/v1alpha1, Kind=TkgServiceConfiguration failed: Post \"https://vmware-system-tkg-webhook-service.vmware-system-tkg.svc:443/convert?timeout=30s\": dial tcp 10.1.1.1:443: connect: connection refused\n

This ip address correlates to the internal service ip. The actual error is unrelated to a networking failure and is instead due to the fact that the tkg-webhook pods are simply not up yet.

NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
vmware-system-tkg vmware-system-tkg-webhook-service ClusterIP 10.1.1.1 <none> 443/TCP 87d

Inside of the supervisor control plane that is doing the component upgrade (this is the one that will have the /var/log/vmware/upgrade-ctl-compupgrade.log on it) the upgrade-ctl-compupgrade.log also shows it stuck with the same error on the tkg-webhook component.

2022-11-12T16:42:05.466Z ERROR compupgrade: {"error": "Exception", "message": "Failed to run command: ['kubectl', 'apply', '-f', '/usr/lib/vmware-wcp/objects/PodVM-GuestCluster/11-tkgsconfig', '--record'] ret=1 out= err=Flag --record has been deprecated
, --record will be removed in the future\nError from server: error when retrieving current configuration of:\nResource: \"run.tanzu.vmware.com/v1alpha2, Resource=tkgserviceconfigurations\", GroupVersionKind: \"run.tanzu.vmware.com/v1alpha2, Kind=TkgServ
iceConfiguration\"\nName: \"tkg-service-configuration\", Namespace: \"\"\nfrom server for: \"/usr/lib/vmware-wcp/objects/PodVM-GuestCluster/11-tkgsconfig/tkgserviceconfiguration.yaml\": conversion webhook for run.tanzu.vmware.com/v1alpha1, Kind=TkgServi
ceConfiguration failed: Post \"https://vmware-system-tkg-webhook-service.vmware-system-tkg.svc:443/convert?timeout=30s\": dial tcp 10.1.1.1:443: connect: connection refused\n", "backtrace": [" File \"/usr/lib/vmware-wcp/upgrade/compupgrade.py\", lin
e 252, in do\n comp.doUpgrade(upCtx)\n", " File \"/usr/lib/vmware-wcp/objects/PodVM-GuestCluster/10-tkg/gc_component_upgrade.py\", line 82, in doUpgrade\n applyAppConfig(join(TKG_CONFIG, '11-tkgsconfig'))\n", " File \"/usr/lib/vmware-wcp/upgrade
/comphelper.py\", line 236, in applyAppConfig\n run(cmd)\n", " File \"/usr/lib/vmware-wcp/upgrade/comphelper.py\", line 71, in run\n raise Exception(exMsg)\n"]}

NOTE: You can check which components are upgraded/upgrading/failed via this commend on the supervisor control plane node thats doing the component upgrade.

In addition, on vSphere with Tanzu Deployments on NSX-T, you will notice that the esxi host nodes will not be upgraded yet.

Environment

VMware vCenter Server 7.0.x

Cause

This is due to a known issue with the upgrade loop.

This can happen when upgrading to 7.0U3e Supervisor Cluster build or later. You can use the below chart to correlate VC to Supervisor Cluster Version.

The Supervisor version will look something like "v1.21.0+vmware.wcp.2-vsc0.0.12-18735554"
To correlate this to the chart below, the first part of the version is the k8's version noted in the 4th column, and the other one is the vcs0.0.x.

Version	Release Date	vCenter build	Supported K8 versions	Supervisor Cluster Version
vCenter Server 7.0 Update 3h (7.0.3.01000)	2022-09-13	20395099	1.22 1.21 1.20	0.0.19
vCenter Server 7.0 Update 3g (7.0.3.00800)	2022-07-23	20150588	1.22 1.21 1.20	0.0.17
vCenter Server 7.0 Update 3f (7.0.3.00700)	2022-07-12	20051473	1.22 1.21 1.20	0.0.17
vCenter Server 7.0 Update 3e (7.0.3.00600)	2022-05-12	19717403	1.22 1.21 1.20 ~~1.19.1~~	0.0.15

Resolution

Issue is fixed in vCenter 8.0 GA.
Fix in 7.0 is still in progress.

This loop can reconcile itself after a few hours however.

Workaround:

We can workaround this issue. First make sure to validate that this is the EXACT issue you are running into. If you are unsure, please contact VMware Support. You can use Troubleshooting Supervisor Control Plane VMs KB for exact instructions on ssh-ing into supervisor control plane VM's.

1. Identify which Supervisor Control Plane VM is running the component upgrade. This is the one that has the /var/log/vmware/upgrade-ctl-compupgrade.log on it. You can identify it via
ls -l /var/log/vmware

2. From a shell into the SV VM from step 1 take a backup of the upgrade script here.

cp /usr/lib/vmware-wcp/objects/PodVM-GuestCluster/10-tkg/gc_component_upgrade.py /root/gc_component_upgrade.py.backup

3. Edit this file with vi and enable line numbers via :set nu Make sure that you are in command mode and not insert mode.

/usr/lib/vmware-wcp/objects/PodVM-GuestCluster/10-tkg/gc_component_upgrade.py

4. Modify line 14

FROM
TKG_DEPLOYMENT_NAMES = ('vmware-system-tkg-controller-manager',)

TO
TKG_DEPLOYMENT_NAMES = ('vmware-system-tkg-controller-manager', 'vmware-system-tkg-webhook',)

5. Remove the following block of code (Should be lines 94-98)

if not self.tkgServiceConfigurationExists():
   logger.info('Applying default TKGServiceConfiguration')
   applyAppConfig(join(TKG_CONFIG, '11-tkgsconfig'))
else:
   logger.info('Skipping apply of the default TKGServiceConfiguration as it already exists')

6. Wait ~10 minutes for the next loop to happen and check components again to see if the tkgUpgrade component is not complete.

root@422e3622efef84f459d1713d7025acef [ ~ ]# /usr/lib/vmware-wcp/upgrade/upgrade-ctl.py get-status | jq '.progress | to_entries | .[] | "\(.value.status) - \(.key)"' | sort
"skipped - AKOUpgrade"
"skipped - HarborUpgrade"
"skipped - LoadBalancerApiUpgrade"
"skipped - TelegrafUpgrade"
"upgraded - AppPlatformOperatorUpgrade"
"upgraded - CapvUpgrade"
"upgraded - CapwUpgrade"
"upgraded - CertManagerUpgrade"
"upgraded - CsiControllerUpgrade"
"upgraded - ImageControllerUpgrade"
"upgraded - KappControllerUpgrade"
"upgraded - LicenseOperatorControllerUpgrade"
"upgraded - NamespaceOperatorControllerUpgrade"
"upgraded - NetOperatorUpgrade"
"upgraded - NSXNCPUpgrade"
"upgraded - PinnipedUpgrade"
"upgraded - PspOperatorUpgrade"
"upgraded - RegistryAgentUpgrade"
"upgraded - SchedextComponentUpgrade"
"upgraded - SphereletComponentUpgrade"
"upgraded - TkgUpgrade" <<<<<<<<<<<<<<< succeeded
"upgraded - TMCUpgrade"
"upgraded - UCSUpgrade"
"upgraded - UtkgClusterMigration"
"upgraded - UtkgControllersUpgrade"
"upgraded - VmOperatorUpgrade"
"upgraded - VMwareSystemLoggingUpgrade"
"upgraded - WCPClusterCapabilities"

Additional Information

Impact/Risks:
This known issue will cause the supervisor upgrade to take up to 4+ hours to complete.