WCP Supervisor Cluster upgrade to 1.21 or 1.22 hung at 50%
search cancel

WCP Supervisor Cluster upgrade to 1.21 or 1.22 hung at 50%

book

Article ID: 319376

calendar_today

Updated On: 09-29-2022

Products

VMware vSphere ESXi VMware vSphere with Tanzu

Issue/Introduction

Symptoms:
Upgrading the WCP Supervisor Cluster does not proceed past 50% after new SV nodes are built. This impacts Supervisor Cluster upgrades from older versions to 1.21 or 1.22. The following symptoms are present:
  • The user will see a task in vCenter GUI indicating the Namespace Upgrade is in progress: 
 
image.png
  • From the vCenter server SSH session, when running DCLI to query the namespace status, users will see:

dcli> namespacemanagement software clusters get --cluster domain-c8
upgrade_status:
   desired_version: v1.22.6+vmware.wcp.1-vsc0.0.17-19939323
   messages:
      - severity: ERROR
        details: A general system error occurred.

   progress:
      total: 100
      completed: 50
      message: Namespaces cluster upgrade is in the "upgrade components" step.

 

  • When connected to the Supervisor Node via SSH, the user will see errors like the following in the /var/log/vmware/upgrade-ctl-compupgrade.log
 
 "CapwUpgrade": {"status": "failed", "messages": [{"level": "error", "message": "Component CapwUpgrade failed: Failed to run command: 

Resource=customresourcedefinitions\", GroupVersionKind: \"apiextensions.k8s.io/v1, Kind=CustomResourceDefinition\"\nName: \"wcpmachinetemplates.infrastructure.cluster.vmware.com\", Namespace: \"\"\nfor: \"wcp-infrastructure.yaml\": CustomResourceDefinition.apiextensions.k8s.io \"wcpmachinetemplates.infrastructure.cluster.vmware.com\" is invalid: status.storedVersions[0]: Invalid value: \"v1alpha2\": must appear in spec.versions\n", "backtrace": ["  File \"/usr/lib/vmware-wcp/upgrade/compupgrade.py\", line 252, in do\n
 

The below messaging may also appear in the log, this is a symptom of the above errors:
 
2022-09-23T17:16:36.47Z ERROR comphelper: Failed to run command: ['/usr/local/bin/etcdctl_lock', '/vmware/wcp/upgrade/components/lock', '--', '/usr/bin/python3', '/usr/lib/vmware-wcp/upgrade/upgrade-ctl.py', '--logfile', '/var/log/vmware/upgrade-ctl-compupgrade.log', '--statestore', 'EtcdStateStore', 'do-upgrade'] ret=1 out={"error": "OSError", "message": "[Errno 7] Argument list too long: '/usr/local/bin/etcdctl'", "backtrace": ["  File \"/usr/lib/vmware-wcp/upgrade/upgrade-ctl.py\"
 
  • The CAPI-controller-manager, CAPW, and Scheduler pods may be in CrashLoopBackOff state with 1 of 2 containers running:
 
# kubectl get pods -A | grep -v Run
 
NAMESPACE               NAME                                                READY   STATUS             RESTARTS         AGE
kube-system             kube-scheduler-423f01b9b30c727e9c237a00319c15l     1/2     CrashLoopBackOff   5 (99s ago)      57m
svc-tmc-c63             agentupdater-workload-27657688--1-r46p5             0/1     Completed          0                30s
svc-tmc-c63             tmc-agent-installer-27657688--1-wpmxm               0/1     Completed          0                30s
vmware-system-capw      capi-controller-manager-766c6fc449-4qqvf            1/2     CrashLoopBackOff   19 (3m42s ago)   53m
vmware-system-capw      capi-controller-manager-766c6fc449-bcpdq            1/2     CrashLoopBackOff   13 (4m15s ago)   23m
 


 


Environment

VMware vSphere 7.0 with Tanzu

Cause

These symptoms occurs due to upstream K8s issues where deprecated CRDs necessitate removal and upgrades. In this case specifically, CRDs for the v1alpha2 are not being correctly removed, leading to a failure when adding the new CRDs for v1alpha3 and v1beta1. CRD versions cannot be removed when the listed version is present in the 'status.storedVersions' of the CRD. These are retained in .storedVersions if the .served flag is set to true.

This upgrade failure has been narrowed to specific upgrade paths where the environment was initially built on WCP supervisor cluster using vCenter versions prior or equal to 7.0.0d (7.0.0.10700) Build 16749653, then the vCenter upgraded to 7.0 U3e (7.0.3.00600) Build 19717403 where the CAPI/CAPW `v1alpha2` was removed.

Environments initially installed with WCP Supervisor Clusters on vCenter versions 7.0 U1 
(7.0.1.00000) Build 16860138 are not susceptible to this failure.

Resolution

VMware engineering is working to address this issue in future releases of WCP, please use the workaround below if you encounter this issue and need to manually correct it.

Workaround:
NOTE: If you are running this workaround as a proactive fix for your Supervisor Cluster upgrade, please skip step 9, instead, start the WCP Supervisor Cluster upgrade from the vCenter GUI.

1. First, verify that v1alpha2 is a .served version:

 
  • # kubectl get crd -o json machines.cluster.x-k8s.io | jq '.spec.versions[] | "\(.name) \(.served) \(.storage)"'
"v1alpha2 true false"
"v1alpha3 true true"

  • # kubectl get crd -o json machines.cluster.x-k8s.io | jq .status.storedVersions
[ "v1alpha2", "v1alpha3" ]


2. SSH to Supervisor Control Plane Node, gather the script attached to this KB and SCP it to the control plane node in /tmp/. Extract the script after importing using the following command:
 
# cd /tmp
# tar -zxf patch-capi-versions-Linux-x86_64.tar.gz


3. Start proxy on port 8080 in order to run the script:
 
kubectl proxy --port=8080 &
Starting to serve on 127.0.0.1:8080

 
4. Alias the proxy pid from prior command
 
# proxy_pid=$!


5. Run the script to gather CRD resources presently available:
 
# ./patch-capi-versions-Linux-x86_64
 
clusters.cluster.x-k8s.io                              v1alpha2        storage=false  served=true
clusters.cluster.x-k8s.io                              v1alpha3        storage=false  served=true
clusters.cluster.x-k8s.io                              storedVersions  [v1alpha2 v1alpha3]

We can see that served=true on the above output, this is what is causing the problem. We can remove this by running the script again with the -update flag.


6. Update CRDs:

 
# ./patch-capi-versions-Linux-x86_64 -update


7. Kill the proxy after script completion:

kill $proxy_pid
[1]+ Terminated: 15 kubectl proxy --port=8080


8. Confirm v1alpha2 is no longer set to served or stored:
 
  • # kubectl get crd -o json machines.cluster.x-k8s.io | jq '.spec.versions[] | "\(.name) \(.served) \(.storage)"'\
"v1alpha2 false false" "v1alpha3 true true"
  • # kubectl get crd -o json machines.cluster.x-k8s.io | jq .status.storedVersions
[ "v1alpha3" ]
 
9. After confirming the CRDs are successfully updated. Proceed with the WCP upgrade script:
 
NOTE: If you are running this workaround as a proactive fix for your Supervisor Cluster upgrade, please skip step 9 and run the WCP Supervisor Cluster upgrade from the vCenter GUI.
 
# bash /usr/lib/vmware-wcp/objects/PodVM-GuestCluster/20-capw/install.sh
 


The WCP cluster upgrade should proceed and complete the component upgrades along with the Spherelet upgrades if the environment is configured on NSX.
 
 


Attachments

patch-capi-versions-Linux-x86_64.tar get_app