vSphere with Tanzu Supervisor Upgrade to 7.0U3f (patch05) stuck after deploying all three supervisor control plane VMs

Products

VMware vCenter Server

Issue/Introduction

Symptoms:

Check the Supervisor upgrade status to see if it's stuck at components upgrade step

Example:
dcli> namespacemanagement software clusters get --cluster domain-c1006
upgrade_status:
   desired_version: v1.20.8+vmware.wcp.1-vsc0.0.17-19939323
   messages:
      - severity: ERROR
        details: A general system error occurred.

   progress:
      total: 100
      completed: 50
      message: Namespaces cluster upgrade is in the "upgrade components" step. -> Stuck at here
available_versions:
   - v1.20.8+vmware.wcp.1-vsc0.0.17-19939323
current_version: v1.19.1+vmware.2-vsc0.0.8-17694864
messages:

Determine which component has issues. Currently 2 components can have this issue: tkg, capw.

2.1 Follow KB 90194 to ssh into the Supervisor Control Plane VM as root.

2.2 Use upgrade-ctl.py to get status of components from CP Virtual Machine as root:
root@420826d4f6a63a8eafdd88bec59cd6e8 [ ~ ]# /usr/lib/vmware-wcp/upgrade/upgrade-ctl.py get-status | jq '.progress | to_entries | .[] | "\(.value.status) - \(.key)"' | sort
"failed - CapwUpgrade"
"failed - TkgUpgrade"
"pending - CapvUpgrade"
"pending - NamespaceOperatorControllerUpgrade"
"processing - LicenseOperatorControllerUpgrade"
"skipped - AKOUpgrade"
"skipped - HarborUpgrade"
"skipped - LoadBalancerApiUpgrade"
"skipped - PinnipedUpgrade"
"skipped - TelegrafUpgrade"
"upgraded - AppPlatformOperatorUpgrade"
"upgraded - CertManagerUpgrade"
"upgraded - CsiControllerUpgrade"
"upgraded - ImageControllerUpgrade"
"upgraded - NetOperatorUpgrade"
"upgraded - NSXNCPUpgrade"
"upgraded - PspOperatorUpgrade"
"upgraded - RegistryAgentUpgrade"
"upgraded - SchedextComponentUpgrade"
"upgraded - SphereletComponentUpgrade"
"upgraded - TMCUpgrade"
"upgraded - UCSUpgrade"
"upgraded - VmOperatorUpgrade"
"upgraded - VMwareSystemLoggingUpgrade"
"upgraded - WCPClusterCapabilities"

In above case, capw and tkg components already failed, license operator is in progress but ultimately will fail because garbage collector failure. After fixing capw and tkg, license operator will be fixed automatically. If other components have failed, please open a case with VMware support to troubleshoot.

Check if there are orphaned pods and replicasets in vmware-system-tkg, vmware-system-capw, vmware-system-license-operator namespace. This means the owners of the pods and replicasets already gone. We can determine this by check the ownerReference filed in metadata of the RUNNING pod/replicaset.

Those are the replicasets and pods we care about, with below name as name prefix:
capw: capi-webhook, capi-kubeadm-bootstrap-webhook, capi-kubeadm-control-plane-webhook
tkg: vmware-system-tkg-webhook

Problematic pods with above name prefix are Running but have old version.
For problematic replicasets with above name prefix, the owner deployments of them are already gone.

Also check if there is error related to garbage-collector sync in kube-controller-manage(the leader one) like:
2022-06-13T05:49:13.267322272Z stderr F I0613 05:49:13.267242       1 shared_informer.go:240] Waiting for caches to sync >for garbage collector
2022-06-13T05:49:13.281721999Z stderr F E0613 05:49:13.281667       1 reflector.go:138] k8s.io/client->go/metadata/metadatainformer/informer.go:90: Failed to watch *v1.PartialObjectMetadata: failed to list >*v1.PartialObjectMetadata: conversion webhook for run.tanzu.vmware.com/v1alpha1, Kind=ProviderServiceAccount failed: the >server could not find the requested resource
...

2022-06-13T05:50:18.063570304Z stderr F E0613 05:50:18.063472       1 shared_informer.go:243] unable to sync caches for >garbage collector

PLEASE NOTE: When on the supervisor control plane VM you have permissions to permanently damage the cluster. If VMware Support finds evidence of a customer making changes to the supervisor cluster from the SV VM, they may mark your cluster as unsupported and require you redeploy the entire vSphere with Tanzu solution. Only use this session to test networks, look at logs, and run kubectl logs/get/describe commands. Do not deploy, delete, or edit anything from this session without the express permissions from VMware Support or specific instructions about what exactly you need to deploy/delete/edit from a kb.

Environment

VMware vCenter Server 7.0.3

Cause

System webhook deployment deletion will cause conversion webhook service down. Conversion webhook service down will cause garbage collector down. Old webhook pods cannot be garbage collected then new webhook pods cannot be up which causes upgrade stuck.

Resolution

Currently there is no resolution. The issue will be fixed in future release.

Workaround:

To workaround the issue please follow the below mentioned steps:

Delete orphaned replicasets and pods.

First delete replicaset:
kubectl get replicaset -n <namespace> --no-headers=true | awk '/<deployment-name>/{print $1}'| xargs -r kubectl delete -n <namespace> replicaset

Then delete pods:
kubectl get pod -n vmware-system-capw --no-headers=true | awk '/<deployment-name>/{print $1}'| xargs -r kubectl delete -n <namespace> pod

Replace <namespace> and <deployment-name> in above commands.
The namespace should be in (vmware-system-capw, vmware-system-tkg).
The component upgrade is run in above order(capw -> tkg). If capw already have orphaned pods then tkg must fail. If only tkg has failures, then we don't need to do this manual step for capw.

For the pods and replicasets, we can delete them using deployment-name as the name prefix. Below are the pods/replicasets name we should delete:

vmware-system-capw: capi-webhook, capi-kubeadm-bootstrap-webhook, capi-kubeadm-control-plane-webhook
vmware-system-tkg: vmware-system-tkg-webhook

After clean up all the orphaned replicasets and pods, wait some time for garbage collector to recover. Watch kube-controller-manager log to make sure that no errors related to garbage collector sync are shown any more.
Upgrade should continue and all system pods should be back to Running.
After Supervisor upgrade recovered and done successfully. If there is an issue that Tanzu Kubernetes Clusters cannot be shown on VC UI. Please do following steps on the VCSA:

vmon-cli --restart wcp