VKS Service Supervisor Tanzu-Auth-Controller Pod in CrashLoopBackOff State due to OOMKilled
search cancel

VKS Service Supervisor Tanzu-Auth-Controller Pod in CrashLoopBackOff State due to OOMKilled

book

Article ID: 401361

calendar_today

Updated On:

Products

VMware vSphere Kubernetes Service

Issue/Introduction

In a vSphere Supervisor environment with a large number of workload clusters running, system pods for tanzu-auth within the Supervisor cluster are failing in CrashLoopBackOff state.

While connected to the Supervisor cluster context, the following issues are observed:

  • The tanzu-auth-controller system pods associated with PKGI under the VKS Service namespace svc-tkg-domain-c## are stuck in CrashLoopBackOff state:
    • kubectl get pods -A | egrep -v "Run|Complete"
  • Describing the tanzu-auth-controller system pod returns that one or more containers are failing due to OOMKilled:
    • kubectl describe pod -n <tanzu-auth-controller namespace> <tanzu-auth-controller pod name>
    • finishedAt: "YYYY-MM-DDTHH:MM:SSZ"
              reason: OOMKilled
            startedAt: "YYYY-MM-DDTHH:MM:SSZ"
        name: tanzu-auth-controller-manager

Environment

vSphere Supervisor 8.0u3 and higher

VKS Service 3.0.0, 3.1.1

Cause

The affected system pods default memory limits are unable to keep up with the large amount of resources needed by a large vSphere Supervisor environment.

Editing the pods or corresponding deployment will revert any changes made because VKS service system pods are controlled by kapp-controller.

Kapp-controller will automatically revert changes made to defaults.

Resolution

Please reach out to VMware by Broadcom Technical Support referencing this KB article for assistance increasing the memory limit defaults for tanzu-auth-controller without kapp-controller reverting it to defaults.

Additional Information

VKS service 3.3.2 has improvements for Supervisor system pod memory usage and memory limits.