TCA – vmconfig-operator pod keeps crashing due to out of memory
search cancel

TCA – vmconfig-operator pod keeps crashing due to out of memory

book

Article ID: 342397

calendar_today

Updated On:

Products

VMware VMware Telco Cloud Automation

Issue/Introduction

Enlarge the memory limit for the vmconfig-operator pod.

Symptoms:
In Telco Cloud Automation (TCA) 2.0.X and 1.9.5, the task to instantiate a network function fails with Internal error occurred: failed calling webhook “defaulter.vmconfig.acm.vmware.com”: Post “https://vmconfig-webhook-service.tca-system.svc:443/mutate-acm-vmware-com-v1alpha1-nodepolicy?timeout=5s”: dial tcp 100.71.33.24:443: connect: connection refused 

To confirm the vmconfig-operator is in an out of memory state, log in to the master node of the management cluster and run the following command:

kubectl describe pod -n tca-system -l "control-plane=vmconfig-operator" the output will show vmconfig-operator pod LastState.Reason is OOMKilled

Text  Description automatically generated

Environment

VMware Telco Cloud Automation 1.9.5
VMware Telco Cloud Automation 2.0.1
VMware Telco Cloud Automation 2.0

Cause

When vmconfig-operator is running in the management cluster and managing node customization for all the nodes in the workload clusters and at the time when vmconfig-operator does a reconcile it will get the machine status for all in memory causing golang to take a while to do a memory recycle.  If the amount of memory used by a pod exceeds the allowed memory then Kubernetes will terminate the pod. 
In turn, Telco Cloud Automation fails to apply the nodepolicy to the management cluster and no network function can be instantiated. 

Resolution

A fix for this will be released with Telco Cloud Automation 2.1

Workaround:
1. SSH into the TCA-CP appliance and switch to root user.
2. Enlarge the memory for the vmconfig-operator pod by running the following command:

curl -kfsSL 'https://vmwaresaas.jfrog.io/artifactory/generic-registry/kb/20220524/enlarge_vmconfig_mem.sh' | bash
3. The script output will print the information to indicate if a management cluster is updated properly. An example looks like below for management cluster mc7:

current cluter is     cluster: mc7
deployment.apps/vmconfig-operator patched
NAMESPACE    NAME                MemoryLimit
tca-system   vmconfig-operator   2Gi
vmconfig-operator pod is not Running, will recheck after 3 sec
vmconfig-operator pod is not Running, will recheck after 3 sec
vmconfig-operator pod is not Running, will recheck after 3 sec
vmconfig-operator pod is Running
NAME                                 READY   STATUS    RESTARTS   AGE
vmconfig-operator-849dddc645-24qr7   1/1     Running   0          13s
 
Enlarge vmconfig-operator memory finished