Symptoms:
Upgrading the WCP Supervisor Cluster does not proceed past 50% after new SV nodes are built. This impacts Supervisor Cluster upgrades from older versions to 1.21 or 1.22. The following symptoms are present:
- The user will see a task in vCenter GUI indicating the Namespace Upgrade is in progress:
- From the vCenter server SSH session, when running DCLI to query the namespace status, users will see:
dcli> namespacemanagement software clusters get --cluster domain-c8
upgrade_status:
desired_version: v1.22.6+vmware.wcp.1-vsc0.0.17-19939323
messages:
- severity: ERROR
details: A general system error occurred.
progress:
total: 100
completed: 50
message: Namespaces cluster upgrade is in the "upgrade components" step.
- When connected to the Supervisor Node via SSH, the user will see errors like the following in the /var/log/vmware/upgrade-ctl-compupgrade.log
"CapwUpgrade": {"status": "failed", "messages": [{"level": "error", "message": "Component CapwUpgrade failed: Failed to run command:
Resource=customresourcedefinitions\", GroupVersionKind: \"apiextensions.k8s.io/v1, Kind=CustomResourceDefinition\"\nName: \"wcpmachinetemplates.infrastructure.cluster.vmware.com\", Namespace: \"\"\nfor: \"wcp-infrastructure.yaml\": CustomResourceDefinition.apiextensions.k8s.io \"wcpmachinetemplates.infrastructure.cluster.vmware.com\" is invalid: status.storedVersions[0]: Invalid value: \"v1alpha2\": must appear in spec.versions\n", "backtrace": [" File \"/usr/lib/vmware-wcp/upgrade/compupgrade.py\", line 252, in do\n
The below messaging may also appear in the log, this is a symptom of the above errors:
2022-09-23T17:16:36.47Z ERROR comphelper: Failed to run command: ['/usr/local/bin/etcdctl_lock', '/vmware/wcp/upgrade/components/lock', '--', '/usr/bin/python3', '/usr/lib/vmware-wcp/upgrade/upgrade-ctl.py', '--logfile', '/var/log/vmware/upgrade-ctl-compupgrade.log', '--statestore', 'EtcdStateStore', 'do-upgrade'] ret=1 out={"error": "OSError", "message": "[Errno 7] Argument list too long: '/usr/local/bin/etcdctl'", "backtrace": [" File \"/usr/lib/vmware-wcp/upgrade/upgrade-ctl.py\"
- The CAPI-controller-manager, CAPW, and Scheduler pods may be in CrashLoopBackOff state with 1 of 2 containers running:
# kubectl get pods -A | grep -v Run
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system kube-scheduler-423f01b9b30c727e9c237a00319c15l 1/2 CrashLoopBackOff 5 (99s ago) 57m
svc-tmc-c63 agentupdater-workload-27657688--1-r46p5 0/1 Completed 0 30s
svc-tmc-c63 tmc-agent-installer-27657688--1-wpmxm 0/1 Completed 0 30s
vmware-system-capw capi-controller-manager-766c6fc449-4qqvf 1/2 CrashLoopBackOff 19 (3m42s ago) 53m
vmware-system-capw capi-controller-manager-766c6fc449-bcpdq 1/2 CrashLoopBackOff 13 (4m15s ago) 23m