This is meant to be shared with customers running a large number of TKG clusters who are experiencing issues realizing those clusters due to the pod crashing due to insufficient memory allocated to the pod.
Symptoms:
If TKG clusters are taking a long time to be realized when at scale (ex. hundreds of TKG clusters), then it could be due to the kubeadm-control-plane pod crashing due to insufficient memory.
To determine if this is the case, please follow these steps:
1. SSH into the vCenter appliance:
ssh root@<VCSA_IP>2. Follow
KB 90194 to ssh into the Supervisor Control Plane VM as root.
3. Check to see if kubeadm-control-plane pod has crashed due to an Out of Memory (OOM) error:
kubectl -n vmware-system-capw \
describe pods -l control-plane=controller-manager | \
grep -F OOMKilledIf OOMKilled is in the output from the above command, then the pod was terminated due to lack of sufficient memory.
PLEASE NOTE: When on the supervisor control plane VM you have permissions to
permanently damage the cluster. If VMware Support finds evidence of a customer making changes to the supervisor cluster from the SV VM, they may mark your cluster as unsupported and require you redeploy the entire vSphere with Tanzu solution. Only use this session to test networks, look at logs, and run kubectl logs/get/describe commands. Do not deploy, delete, or edit anything from this session without the express permissions from VMware Support or specific instructions about what exactly you need to deploy/delete/edit from a kb.