This article assists users who are having trouble realizing their large number of TKG clusters since the pod crashed due to inadequate memory being allotted to the pod.
If TKG clusters are taking a long time to be realized at scale (Example: Hundreds of TKG clusters), then it could be due to the capi-controller-manager pod crashing due to insufficient memory. To determine if this is the case, please follow these steps:
1. SSH into the vCenter appliance:
ssh root@<VCSA_IP>
2. Print the credentials used to login to the Supervisor control plane:
/usr/lib/vmware-wcp/decryptK8Pwd.py
3. SSH into the Supervisor control plane using the IP and credentials from
the previous step:
ssh root@<SUPERVISOR_IP>
4. Check to see if capi-controller-manager pod has crashed due to an Out
of Memory (OOM) error:
kubectl -n vmware-system-capw \
describe pods -l name=capi-controller-manager | \
grep -F OOMKilled
If OOMKilled is in the output from the above command, then the pod was terminated due to lack of sufficient memory.
The capi-controller-manager pod is allocated 1200Mi (MB), but because of recent adjustments and an organic rise in the number of resources being monitored by the controller, more memory is being used during active reconciliation. The 1200Mi hard limit is exceeded by this "burst" requirement.
Currently there is no resolution.