How to Increase tkr-source-controller deployment resource limits on TKG 2.1.x to resolve 'OOMKilled' Error
search cancel

How to Increase tkr-source-controller deployment resource limits on TKG 2.1.x to resolve 'OOMKilled' Error

book

Article ID: 315182

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid Management

Issue/Introduction

Symptoms:

  • When upgrading a cluster via the tanzu cli, you could get the following error in the output.

    $ tanzu cluster upgrade gitlab-test --tkr v1.23.16---vmware.1-fips.1-tkg.1 -n default -v 9
    compatibility file (/root/.config/tanzu/tkg/compatibility/tkg-compatibility.yaml) already exists, skipping download
    BOM files inside /root/.config/tanzu/tkg/bom already exists, skipping download
    Using the TKr version 'v1.23.16+vmware.1-fips.1-tkg.1'
    Downloading bom for TKr "v1.23.16---vmware.1-fips.1-tkg.1"
    Error: unable to determine the TKr version and kubernetes version based on '': ConfigMap for TKr name "v1.23.16---vmware.1-fips.1-tkg.1" not available to download bom

     

  • The tkr-source-controller pod will go into a CrashLoopBackOff with the error "OOMKilled".  

Environment

VMware Tanzu Kubernetes Grid 2.1.0
VMware Tanzu Kubernetes Grid 2.1.1

Cause

In certain scenarios tkr-source-controller can go into CrashLoopBackOff state when it runs out of resources. In TKG 2.1.x+, once the deployment is modified with increased resources, it will be reconciled and the changes will be reverted after a few minutes. This workaround will allow you to increase the deployment spec in a way that the reconciler will not revert the changes we make.

Please refer to 'Important Notes' in the Workaround section for information regarding the versions of TKG in which this issue has been resolved.

Resolution

In a future release of TKG the resource limitation on tkr-source-controller will be addressed.

Workaround:
Check the default limits for the tkr-source-controller:
kubectl  describe pod -n tkg-system tkr-source-controller-manager-pod-name | egrep -iB1 'cpu|mem'

Limits:
  cpu:     100m
  memory:  200Mi
Requests:
  cpu:        100m
  memory:     100Mi

Pause reconciliation of both "tkg-pkg" and "tkr-source-controller":
kubectl patch pkgi -n tkg-system tkg-pkg -p '{"spec":{"paused":true}}' --type=merge
kubectl patch pkgi -n tkg-system tkr-source-controller -p '{"spec":{"paused":true}}' --type=merge

Modify the tkr-source-controller deployment with increased memory limits:
kubectl edit deployments.apps -n tkg-system tkr-source-controller-manager

Below are the recommended changes to the pod memory limits:
kubectl describe pod -n tkg-system tkr-source-controller-manager-c8bfc544b-8ctkj | egrep -iB1 'cpu|mem'

Limits:
  cpu:     500m
  memory:  600Mi
Requests:
  cpu:        200m
  memory:     200Mi

After these changes are made, the tkr-source-controller should no longer be in a CrashLoopBackOff and the changed made to the memory limits will not be reconciled. 

IMPORTANT NOTES:
  • Since the package is paused, if manual changes are made to the management cluster deployment associated with these packages, the reconciliation to the default state will not occur. As these management components do not require frequent changes, pausing these packages is safe as long as you do not modify the management cluster deployment associated with them. 
  • For this workaround to work, the tkg-pkg and tkr-source-controller packages must remain in a paused state. 
  • This is a temporary fix. Future patch releases 2.1.2, 2.2.1, 2.3.1 and all versions of TKG including and after 2.4.0 contain the increased pod spec. Stop using the workaround if you are upgrading to a version that contains the fix. When upgrading, please follow the below steps:
    • Upgrading to a version of TKG that includes the increased pod spec:
      • Unpause tkg-pkg and tkr-source-controller and upgrade the management cluster. You will no longer experience the "OOMKilled" error on these versions and you can stop using this workaround.  
    • Upgrading to a version of TKG that does not include the increased pod spec:
      • Unpause tkg-pkg and tkr-source-controller and then upgrade the management cluster. Once the upgrade has completed, the tkr-source-controller will be failing with "OOMKilled". Follow the workaround in this KB to resolve the error once again. It is highly recommended that you upgrade to a version that includes the fix.