TCA Manager Postgres Pod Restarts Repeatedly with OOMKilled
search cancel

TCA Manager Postgres Pod Restarts Repeatedly with OOMKilled

book

Article ID: 440014

calendar_today

Updated On:

Products

VMware Telco Cloud Automation VMware Telco Cloud Platform

Issue/Introduction

The postgres pod in the tca-mgr namespace on the VMware Telco Cloud Automation Manager appliance restarts with an OOMKilled status. Consequently, the tca-app pod may crash due to a backlog of jobs in Kafka that cannot be processed while the database is unavailable.

This condition is identified by checking the postgres pod status and limits. If the memory limit for the pg-container is set to 1050Mi instead of the expected 3500Mi or 3850Mi for a large or xlarge deployment, the system is affected:

kubectl get pods -n tca-mgr | grep postgres
kubectl describe pod postgres-0 -n tca-mgr

Environment

VMware Telco Cloud Automation 3.4.0

Cause

A defect in the TCA 3.4.0 installer script incorrectly propagates the tshirtSize parameter for the postgres Helm chart during the deployment of the TCA Manager appliance. Instead of applying the expected large or xlarge resource limits, the postgres pod falls back to the default medium size (1500Mi memory limit across all containers).

When the system processes a large number of alarms or events, specific PostgreSQL queries consume excessive memory. Because the pod is severely under-provisioned, it quickly exceeds the 1500Mi cgroup limit and is terminated by the Kubernetes Out-Of-Memory (OOM) killer.

Resolution

To resolve the issue and stabilize the system, manually upgrade the postgres Helm release to apply the correct tshirtSize. This operation causes Postgres to be temporarily unavailable while the pod is redeployed. The fix persists across system restarts and replicates the intended installer sizing parameters.

Execute the following commands as the admin user on the TCA Manager appliance:

  1. Determine the correct intended tshirtSize of the TCA Manager appliance (typically large or xlarge) by checking the appliance properties file:

cat /common/configs/appliance.properties | grep resource_size
  1. Upgrade the postgres Helm chart, reusing existing values and explicitly setting the tshirtSize (replace large with xlarge if the appliance is an xlarge deployment):

helm upgrade postgres /opt/vmware/helm_charts/postgres -n tca-mgr --reuse-values --set tshirtSize=large --timeout 20m
  1. Verify that the postgres pod is restarting and applying the new resource limits:

kubectl get pods -n tca-mgr | grep postgres
  1. Once the pod is in the Running state, verify the new limits (it should now show 5000Mi or 5500Mi for memory across all containers in the pod, or 3500Mi or 3850Mi for the pg-container):

kubectl describe pod postgres-0 -n tca-mgr | grep -A 5 -i limits

Additional Information

This issue will be resolved in a future patch or release of VMware Telco Cloud Automation.