NSX Application Platform Health - Node memory usage very high alarms seen on NSX-T UI
search cancel

NSX Application Platform Health - Node memory usage very high alarms seen on NSX-T UI

book

Article ID: 319827

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

High memory alarms are being raised on/off on the NSX-T UI. Please  see attached screenshot for the type of alarm that can be raised: high-memory alarm & high-memory-alarm-detail
 
 
 
 

Environment

VMware NSX-T Data Center 4.x
NAPP - 4.1.x

Cause

Several pods in the nsxi-platform namespace consume large amounts of memory. Upon initial scheduling of the pod, Kubernetes may not have been aware of the large memory consumption, thus leading to an unbalanced distribution of memory consumption.

This information can be captured by using the following command from within the context of the napp cluster: kubectl top pods -n nsxi-platform --sort-by=memory

Resolution

Run the following commands on NSX-t manager to check for memory usage on the cluster

How to check memory usage for worker nodes

napp-k top nodes
napp-k top nodes --sort_by=memory # Sort by memory usage desc
napp-k top nodes --sort_by=cpu # Sort by cpu usage desc

How to check memory usage for pods

napp-k top pods
napp-k top pods --sort_by=memory # Sort by memory usage desc
napp-k top pods --sort_by=cpu # Sort by cpu usage desc

How to check pods on a specific node

napp-k get pods -o wide | grep <node-name>

How to check memory usage for pods on a specific node

napp-k get pods '-o=custom-columns=NAMESPACE:metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName,PODIP:.status.podIP,Container_Name:.spec.containers[*].name,CPU_Limit:.spec.containers[*].resources.limits.cpu,CPU_requests:.spec.containers[*].resources.requests.cpu,MEM_Limit:.spec.containers[*].resources.limits.memory,MEM_requests:.spec.containers[*].resources.requests.memory'  | grep <node-name>

What to do if all worker nodes have high memory usage (at least one worker node >80%, and more than half >70%)

Disable NSX Suspicious Traffic detectors

If NSX Suspicious Traffic detectors are enabled, it is recommended to disable all of them to avoid sudden memory spike. They can be re-enabled after the system is stable.

Please refer to https://docs.vmware.com/en/VMware-NSX-Intelligence/4.1/user-guide/GUID-AA78841C-2F90-4BAF-8905-93BFB7EE6D71.html on how to toggle detectors.

Add worker nodes

Contact VMware Tanzu to help add more worker nodes to the system. 

It is recommended to add 3 more worker nodes if:

2 or more worker nodes are using over 80% memory

It is recommended to add 2 more worker nodes if:

1 worker node is using over 80% memory

Re-balance pods across worker nodes

Scale-Out is not required and not recommended after adding the worker nodes. If Scale-Out is performed, please re-do the previous step to add additional worker nodes.

After adding new worker nodes, you should see them when running the following command, with very little memory usage. ( attached in KB)
 

Steps for re-balancing the pods
1. Download the memory_usage.sh on the NSX Manager and give it an executable access by running "chmod +x memory_usage.sh" 
*NOTE* Do not try to run this script from /tmp directory, you will get a permission denied error
2. Run ./memory_usage.sh

Modify pod resource request and limit

First, to solve a potential issue with the vi editor, run

export KUBE_EDITOR=vim.tiny

Alternatively, you can type these when inside the "napp-k edit" editor

:set nocompatible
:set backspace=indent,eol,start

Metrics

Due to a known issue, metrics pods are only requesting cluster default resources but using more, therefore over-commiting resources. 

Use the following commands to modify memory requests and limits for these pods.

For deployment pods

napp-k edit deployment <name> #name can be metrics-app-server, metrics-db-helper, metrics-manager, metrics-nsx-config, metrics-postgresql-ha-pgpool, metrics-query-server

For statefulset pods

napp-k edit statefulset <name> #name can be metrics-postgresql-ha-postgresql

When inside the text editor

1. Use "/resources" to find the "resources" field
 
          failureThreshold: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 60
        resources: {}
        securityContext:
          allowPrivilegeEscalation: false
2. Use "i" to enter insert mode
3. Add or modify "requests" and "limits" in the "resources" field
 
        resources:
          limits:
            memory: 5Gi
          requests:
            memory: 1Gi
 
4. After modification, use ESC then ":wq" to save, or ESC then "q!" to quit without saving

The values to change to are

Service

    Memory


 

Request     

Limit     

Metrics PostgreSQL

1Gi

5Gi

Metrics PgPool

1Gi

2Gi

Metrics Manager

1Gi

5Gi

Metrics Query Server    

1Gi

2Gi

Metrics DB Helper

1Gi

1Gi

Metrics API Server

2Gi

3Gi

Metrics NSX Config

1Gi

2Gi

Druid and Kafka

Use the following commands to modify the resources for memory requests and limits for these pods.

The values below assume that only Metrics and NSX Intelligence is installed. When other applications are installed, more pods will contend for resources, and the values may need to be lowered to make room for others.

If after updating the resources, any Druid or Kafka pod is stuck in "Pending" state (due to unable to be scheduled on a worker node), reduce the request used by that deployment or statefulset.

 

To understand the existing memory usage, use the following command to check the memory used by "druid-broker", "druid-config-broker", "druid-historical", "druid-config-historical", "druid-middle-manager", "kafka" pods.

napp-k top pods --sort_by=memory | grep <pod-name>

"druid-broker" and "druid-config-broker": If druid broker pods are using less than 7GB of memory or druid config broker pods are using less than 5GB, use the memory numbers in the below table under the column titled "Memory (if less)" otherwise use the values from column titled  "Memory (if more)".

napp-k edit deployment <name> #name can be druid-broker, druid-config-broker

"druid-historical" and "druid-config-historical": If druid historical pods or druid config historical pods are using less than 7GB, use the memory numbers in the below table under the column titled "Memory (if less)" otherwise use the values from column titled  "Memory (if more)".

napp-k edit sts <name> #name can be druid-historical, druid-config-historical

"druid-middle-manager": If druid middle manager pods are using less than 11GB of memory, use the memory numbers in the below table under the column titled- "Memory (if less)" otherwise use the values from column titled  "Memory (if more)".

napp-k edit sts <name> #name can be druid-middle-manager

"kafka": If kafka pods are using less than 5GB of memory, use the memory numbers in the below table under the column titled- "Memory (if less)" otherwise use the values from column titled  "Memory (if more)".

napp-k edit sts <name> #name can be kafka

Service

Memory (if less)

  Memory (if more)


 

 Request      

 Limit      

  Request     

  Limit       

Druid Broker                      

  8Gi

8Gi  

10Gi   

10Gi.  

Druid Config Broker

  6Gi

6Gi

8Gi

8Gi

Druid Middle Manager

  12Gi

15Gi

15Gi

20Gi

Druid Historical

  8Gi

8Gi

10Gi

10Gi

Druid Config Historical

  8Gi

8Gi

10Gi

10Gi

Kafka

  as-is

6Gi

6Gi

8Gi

What to do if memory usage is imbalanced (at least one worker node > 80%, while more than half <70%)

Follow the previous section, but skip the 2nd part "Add worker nodes".

It is recommended to add 3 more worker nodes if:

2 or more worker nodes are using over 80% memory

 

It is recommended to add 2 more worker nodes if:

1 worker node is using over 80% memory

Attachments

memory_usage.sh get_app