In a vSphere Supervisor environment with a large number of workload clusters running, system pods for imageregistry within the Supervisor cluster are failing in CrashLoopBackOff state.
While connected to the Supervisor cluster context, the following issues are observed:
kubectl get pods -A | egrep -v "Run|Complete"
NAME READY STATUS
vmware-system-imageregistry-controller-manager-<pod id> 0/2 CrashLoopBackOff
kubectl describe pod -n <imageregistry namespace> <image-registry pod name>
finishedAt: "YYYY-MM-DDTHH:MM:SSZ"
reason: OOMKilled
startedAt: "YYYY-MM-DDTHH:MM:SSZ"
name: manager
vSphere Supervisor 8.0u3 and higher
VKS Service 3.0.0 and higher
The affected system pods default memory limits are unable to keep up with the large amount of resources needed by a large vSphere Supervisor environment.
Editing the pods or corresponding deployment will revert any changes made because VKS service system pods are controlled by kapp-controller.
Kapp-controller will automatically revert changes made to defaults.
Please reach out to VMware by Broadcom Technical Support referencing this KB article for assistance on increasing the memory limit defaults for image-registry system pods without kapp-controller reverting it to defaults.
VKS service 3.3.2 has improvements for Supervisor system pod memory usage and memory limits.