tkr-source-controller deployment resource crashes due to 'OOMKilled' error
search cancel

tkr-source-controller deployment resource crashes due to 'OOMKilled' error

book

Article ID: 315182

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid Management VMware Telco Cloud Automation

Issue/Introduction

  • When upgrading a cluster via the tanzu cli, you get the following error:
    $ tanzu cluster upgrade gitlab-test --tkr v1.23.16---vmware.1-fips.1-tkg.1 -n default -v 9
    compatibility file (/root/.config/tanzu/tkg/compatibility/tkg-compatibility.yaml) already exists, skipping download
    BOM files inside /root/.config/tanzu/tkg/bom already exists, skipping download
    Using the TKr version 'v1.23.16+vmware.1-fips.1-tkg.1'
    Downloading bom for TKr "v1.23.16---vmware.1-fips.1-tkg.1"
    Error: unable to determine the TKr version and kubernetes version based on '': ConfigMap for TKr name "v1.23.16---vmware.1-fips.1-tkg.1" not available to download bom
  • The tkr-source-controller pod enters a CrashLoopBackOff with the error "OOMKilled".  

Environment

VMware Tanzu Kubernetes Grid 2.1.0
VMware Tanzu Kubernetes Grid 2.1.1
VMware Telco Cloud Automation 2.3

Cause

The tkr-source-controller can go into a CrashLoopBackOff state if it runs out of resources. In TKG 2.1.x and above, if the deployment is modified with increased resources it will be reconciled and the changes will be reverted. The below workaround will allow you to increase the deployment's configuration in a way that persists.

Note: Please refer to the 'Important Notes' at the bottom for information regarding the versions of TKG in which this issue has been resolved.

Resolution

  1. Check the default limits for the tkr-source-controller:

    kubectl  describe pod -n tkg-system tkr-source-controller-manager-pod-name | egrep -iB1 'cpu|mem'

    Limits:
      cpu:     100m
      memory:  200Mi
    Requests:
      cpu:        100m
      memory:     100Mi

  2. Pause reconciliation of both "tkg-pkg" and "tkr-source-controller":

    kubectl patch pkgi -n tkg-system tkg-pkg -p '{"spec":{"paused":true}}' --type=merge
    kubectl patch pkgi -n tkg-system tkr-source-controller -p '{"spec":{"paused":true}}' --type=merge
     
  3. Modify the tkr-source-controller deployment with increased memory limits:

    kubectl edit deployments.apps -n tkg-system tkr-source-controller-manager

  4. Below are the recommended changes to the pod memory limits:

    Limits:
      cpu:     500m
      memory:  600Mi
    Requests:
      cpu:        200m
      memory:     200Mi
     

IMPORTANT NOTES:

  • Since the package is paused, if manual changes are made to the management cluster deployment associated with these packages, the reconciliation to the default state will not occur. As these management components do not require frequent changes, pausing these packages is safe as long as you do not modify the management cluster deployment associated with them. 
  • For this workaround to work, the tkg-pkg and tkr-source-controller packages must remain in a paused state. 
  • This is a temporary fix. Future patch releases 2.1.2, 2.2.1, 2.3.1 and all versions of TKG including and after 2.4.0 contain the increased pod spec. Stop using the workaround if you are upgrading to a version that contains the fix. When upgrading, please follow the below steps:
    • Upgrading to a version of TKG that includes the increased pod spec:
      • Unpause tkg-pkg and tkr-source-controller and upgrade the management cluster. You will no longer experience the "OOMKilled" error on these versions and you can stop using this workaround.  
    • Upgrading to a version of TKG that does not include the increased pod spec:
      • Unpause tkg-pkg and tkr-source-controller and then upgrade the management cluster. Once the upgrade has completed, the tkr-source-controller will be failing with "OOMKilled". Follow the workaround in this KB to resolve the error once again. It is highly recommended that you upgrade to a version that includes the fix.