WCP Supervisor Cluster upgrade stuck at 50% due to TkgUpgrade Component failure
search cancel

WCP Supervisor Cluster upgrade stuck at 50% due to TkgUpgrade Component failure

book

Article ID: 323418

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere Kubernetes Service

Issue/Introduction

Symptoms: 

  • When upgrading WCP Supervisor Cluster from vCenter, the upgrade is stuck at 50% with GUI messaging in vCenter showing: "Namespaces cluster upgrade is in upgrade component steps"
  • When reviewing the cluster status from vCenter SSH using DCLI, you see messaging similar to:
dcli> namespacemanagement software clusters list

|------------|------------|-----------------------------------|-----------------------------------|-----------------------------------|-------|--------
|cluster |cluster_name|desired_version |available_versions |current_version |state |last_upgraded_date |
|------------|------------|-----------------------------------|-----------------------------------|-----------------------------------|-------|--------
|domain-c8| |v1.19.1+vmware.2-vsc0.0.10-18245956|v1.19.1+vmware.2-vsc0.0.10-18245956|v1.18.2-vsc0.0.7-17449972 |PENDING|2022-06-06T06:13:51.000Z|
|------------|------------|-----------------------------------|-----------------------------------|-----------------------------------|-------|--------

 

 

dcli> namespacemanagement software clusters get --cluster domain-c8
 
upgrade_status:
desired_version: v1.19.1+vmware.2-vsc0.0.10-18245956
messages:
- severity: ERROR
details: A general system error occurred.
progress:
total: 100
completed: 50
message: Namespaces cluster upgrade is in upgrade components step.
available_versions:
- v1.19.1+vmware.2-vsc0.0.10-18245956
current_version: v1.18.2-vsc0.0.7-17449972
messages:
- severity: WARNING
details: Cluster domain-c8 is on the oldest version.
 
state: PENDING
last_upgraded_date: 2022-06-06T06:13:51.000Z

 

  • The /var/log/vmware/wcp/wcpsvc.log on vCenter shows messaging similar to:
2022-06-06T06:13:51.213Z error wcp [kubelifecycle/upgrade_controller.go:312] [opID=upgrade-domain-c8] Failed to upgrade cluster domain-c8, upgrade state: ERROR
........

2022-06-06T08:02:04.795Z error wcp [kubelifecycle/compupgrade.go:235] [opID=upgrade-domain-c8] Component upgrade failed: Component upgrade failed.
 
  • Messaging in the wcpsvc.log on vCenter is followed by errors in the TkgUpgrade component:
2022-06-06T08:05:21.480Z DEBUG comphelper: ret=0 out={"controller_id": "420c1019beddcb6b04247eb123c5689", "state": "error", "messages": [{"level": "error", "message": "Component TkgUpgrade failed: Failed to run command:

 

  • Within the above message stack, you see the specific component failure is:
The validatingwebhookconfigurations \"vmware-system-tkg-validating-webhook-configuration\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update\n"}]}

 

  • This failure may also present with the CapwUpgrade component, specifically the  capi-kubeadm-control-plane-validating-webhook-configuration webhook.
  • Further logging can be found on the Supervisor node SSH in the /var/log/vmware/upgrade-ctl-compupgrade.log file




Environment

VMware vSphere 7.0 with Tanzu

Cause

This issue is caused by a duplicate and differing resourceVersion being presented in the ValidatingWebhookConfiguration in the metadata.annotations field as well as the metadata.resourceVersion field.

The issue occurs when ValidatingWebhookConfigurations are replaced and the yaml file used to re-apply the configuration contains a metadata.resourceVersion line. This leads to a mismatch between the resourceVersion noted in the metadata.annotations line and the metadata.resourceVersion upon applying the yaml.


Resolution

Contact Broadcom support for assistance with cleaning up the webhooks.

 

Additional Information

Impact/Risks:

Supervisor Cluster upgrades will hang until the ValidatingWebhookConfiguration in the Supervisor Cluster is corrected.