Supervisor Cluster upgrade fails with GUI error: "Configured Supervisor Control Plane VM's Workload Network Configuration error"
In Supervisor you will see upgrade failed components by running the below command
/usr/lib/vmware-wcp/upgrade/upgrade-ctl.py get-status | jq '.progress | to_entries | .[] | "\(.value.status) - \(.key)"' | sort"failed - AppPlatformOperatorUpgrade""failed - CapwUpgrade""failed - ImageControllerUpgrade""failed - ImageRegistryUpgrade""failed - NamespaceOperatorControllerUpgrade""failed - NSXNCPUpgrade""failed - RegistryAgentUpgrade""failed - VMwareSystemLoggingUpgrade""skipped - CertManagerAdditionalUpgrade""skipped - HarborUpgrade""skipped - LoadBalancerApiUpgrade""skipped - TkgUpgrade""skipped - VmOperatorUpgrade""upgraded - AKOUpgrade""upgraded - CapvUpgrade""upgraded - CertManagerUpgrade""upgraded - CsiControllerUpgrade""upgraded - ExternalSnapshotterUpgrade""upgraded - KappControllerUpgrade""upgraded - LicenseOperatorControllerUpgrade""upgraded - NetOperatorUpgrade""upgraded - PinnipedUpgrade""upgraded - PspOperatorUpgrade""upgraded - SchedextComponentUpgrade""upgraded - SphereletComponentUpgrade""upgraded - TelegrafUpgrade""upgraded - TMCUpgrade""upgraded - UCSUpgrade""upgraded - UtkgClusterMigration""upgraded - UtkgControllersUpgrade""upgraded - WCPClusterCapabilities"
While describing the pods which are in crashloopbackoff you will have similar output
kubeclt describe pod <pod_name> -n <namespace>State: WaitingReason: CrashLoopBackOffLast State: TerminatedReason: OOMKilled Ready: FALSERestart Count: 1415Limits: cpu: 250mmemory: 150MiRequests: cpu: 100mmemory: 50MiOOMKilled
This is a known issue, currently there is no resolution.
To workaround this issue, increase the pod memory by following the below steps:
Note: Based on the environment scale, these memory values might vary.
NamespaceOperatorControllerUpgrade POD - vmware-system-nsop-controller-manager pod
kubectl -n vmware-system-nsop patch deployments vmware-system-nsop-controller-manager -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","resources":{"limits":{"memory":"700Mi"}}}]}}}}'Tanzu-auth-controller-manager Pod - vmware-system-tkg-controller
kubectl patch limitrange vmware-system-tkg-default-limit-range -n vmware-system-tkg --type='json' -p='[{"op": "replace", "path": "/spec/limits/0/default/memory", "value": "512Mi"}]'kubectl delete pod -n vmware-system-tkg vmware-system-tkg-controller-xxxxx
ImageControllerUpgrade Pod - vmware-system-imageregistry-controller-manager
Paused the pkgi image to stop it from reconciling the deployment, and then increased the memory via the deployment.
kubectl -n vmware-system-imageregistry patch pkgi imageregistry-operator -p '{"spec":{"paused":true}}' --type=merge
Update the deployment
kubectl -n vmware-system-imageregistry patch deployments vmware-system-imageregistry-controller-manager -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","resources":{"limits":{"memory":"1500Mi"}}}]}}}}'
AppPlatformOperatorUpgrade Pod - vmware-system-appplatform-operator-system
Force upgrade from VC GUI and click on "Resume" and all the components which previously failed should be upgraded successfully.
kubectl edit statefulset.apps/vmware-system-appplatform-operator-mgr -n vmware-system-appplatform-operator-system
Edit resource to
resources: limits: cpu: 800m memory: 600Mi
Once all Supervisor CP nodes are upgraded wait for ESXi host/worker nodes upgrade completion.
Run the below command to unpause the packageinstaller (pkgi) object
kubectl -n vmware-system-imageregistry patch pkgi imageregistry-operator -p '{"spec":{"paused":false}}' --type=merge