In a vSphere Supervisor environment with a large number of workload clusters running, system pods for tanzu-auth within the Supervisor cluster are failing in CrashLoopBackOff state.
While connected to the Supervisor cluster context, the following issues are observed:
kubectl get pods -A | egrep -v "Run|Complete"
kubectl describe pod -n <tanzu-auth-controller namespace> <tanzu-auth-controller pod name>
finishedAt: "YYYY-MM-DDTHH:MM:SSZ"
reason: OOMKilled
startedAt: "YYYY-MM-DDTHH:MM:SSZ"
name: tanzu-auth-controller-manager
vSphere Supervisor 8.0u3 and higher
VKS Service 3.0.0, 3.1.1
The affected system pods default memory limits are unable to keep up with the large amount of resources needed by a large vSphere Supervisor environment.
Editing the pods or corresponding deployment will revert any changes made because VKS service system pods are controlled by kapp-controller.
Kapp-controller will automatically revert changes made to defaults.
Please reach out to VMware by Broadcom Technical Support referencing this KB article for assistance increasing the memory limit defaults for tanzu-auth-controller without kapp-controller reverting it to defaults.
VKS service 3.3.2 has improvements for Supervisor system pod memory usage and memory limits.