Supervisor Cluster upgrade stuck in configuring state with multiple pods are in crashloopbackoff with reason as OOMKilled
search cancel

Supervisor Cluster upgrade stuck in configuring state with multiple pods are in crashloopbackoff with reason as OOMKilled

book

Article ID: 372050

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

  • Supervisor Cluster upgrade fails with GUI error: "Configured Supervisor Control Plane VM's Workload Network Configuration error"

  • In Supervisor you will see upgrade failed components by running the below command

/usr/lib/vmware-wcp/upgrade/upgrade-ctl.py get-status | jq '.progress | to_entries | .[] | "\(.value.status) - \(.key)"' | sort
"failed - AppPlatformOperatorUpgrade"
"failed - CapwUpgrade"
"failed - ImageControllerUpgrade"
"failed - ImageRegistryUpgrade"
"failed - NamespaceOperatorControllerUpgrade"
"failed - NSXNCPUpgrade"
"failed - RegistryAgentUpgrade"
"failed - VMwareSystemLoggingUpgrade"
"skipped - CertManagerAdditionalUpgrade"
"skipped - HarborUpgrade"
"skipped - LoadBalancerApiUpgrade"
"skipped - TkgUpgrade"
"skipped - VmOperatorUpgrade"
"upgraded - AKOUpgrade"
"upgraded - CapvUpgrade"
"upgraded - CertManagerUpgrade"
"upgraded - CsiControllerUpgrade"
"upgraded - ExternalSnapshotterUpgrade"
"upgraded - KappControllerUpgrade"
"upgraded - LicenseOperatorControllerUpgrade"
"upgraded - NetOperatorUpgrade"
"upgraded - PinnipedUpgrade"
"upgraded - PspOperatorUpgrade"
"upgraded - SchedextComponentUpgrade"
"upgraded - SphereletComponentUpgrade"
"upgraded - TelegrafUpgrade"
"upgraded - TMCUpgrade"
"upgraded - UCSUpgrade"
"upgraded - UtkgClusterMigration"
"upgraded - UtkgControllersUpgrade"
"upgraded - WCPClusterCapabilities"

  • While describing the pods which are in crashloopbackoff you will have similar output

    • Describe pod's partial output below:

      kubeclt describe pod <pod_name> -n <namespace>
      State: Waiting
      Reason: CrashLoopBackOff
      Last State: Terminated
      Reason: OOMKilled
      Ready: FALSE
      Restart Count: 1415
      Limits:
      cpu: 250m
      memory: 150Mi
      Requests:
      cpu: 100m
      memory: 50Mi
       
    • Following 4 pods were in crashloopbackoff state due to OOMKilled
      • vmware-system-nsop-controller-manager
      • vmware-system-tkg-controller
      • vmware-system-imageregistry-controller-manager
      • vmware-system-appplatform-operator-system

Resolution

This is a known issue, currently there is no resolution.

Workaround:

To workaround this issue, increase the pod memory by following the below steps:

Note: Based on the environment scale, these memory values might vary.

    1. NamespaceOperatorControllerUpgrade POD - vmware-system-nsop-controller-manager pod           

      • kubectl -n vmware-system-nsop patch deployments vmware-system-nsop-controller-manager -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","resources":{"limits":{"memory":"700Mi"}}}]}}}}'
    2. Tanzu-auth-controller-manager Pod - vmware-system-tkg-controller

      • kubectl patch limitrange vmware-system-tkg-default-limit-range -n vmware-system-tkg --type='json' -p='[{"op": "replace", "path": "/spec/limits/0/default/memory", "value": "512Mi"}]'
      • kubectl delete pod -n vmware-system-tkg vmware-system-tkg-controller-xxxxx

    3. ImageControllerUpgrade Pod - vmware-system-imageregistry-controller-manager

      • Paused the pkgi image to stop it from reconciling the deployment, and then increased the memory via the deployment.

        • kubectl -n vmware-system-imageregistry patch pkgi imageregistry-operator -p '{"spec":{"paused":true}}' --type=merge

      • Update the deployment  

        • kubectl -n vmware-system-imageregistry patch deployments vmware-system-imageregistry-controller-manager -p '{"spec":{"template":{"spec":{"containers":[{"name":"manager","resources":{"limits":{"memory":"1500Mi"}}}]}}}}'

    4. AppPlatformOperatorUpgrade Pod - vmware-system-appplatform-operator-system

    5.  Force upgrade from VC GUI and click on "Resume" and all the components which previously failed should be upgraded successfully.

      • kubectl edit statefulset.apps/vmware-system-appplatform-operator-mgr -n vmware-system-appplatform-operator-system     

            •  Edit resource to

                resources:
                  limits:
                    cpu: 800m
                    memory: 600Mi

    6. Once all Supervisor CP nodes are upgraded wait for ESXi host/worker nodes upgrade completion.

    7. Run the below command to unpause the packageinstaller (pkgi) object

      • kubectl -n vmware-system-imageregistry patch pkgi imageregistry-operator -p '{"spec":{"paused":false}}' --type=merge