NSX Application Platform Health Node Memory Usage Very High
search cancel

NSX Application Platform Health Node Memory Usage Very High

book

Article ID: 313950

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

To better balance the memory usage across the Tanzu cluster running Application Platform.

Symptoms:
NSX manager reports an alarm for worker node memory usage over 85%. This may cause instability or temporary outages on Application Platform. 



Using the command kubectl top nodes shows at least one node with >85% memory like the below output:


root [jumpbox]# kubectl top nodes
NAME                                           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
napp-advanced-control-plane-12345              220m         11%    3859Mi          49%
napp-advanced-workers-ab123-7857587fb8-abcde   5050m        31%    55423Mi         86%
napp-advanced-workers-ab123-7857587fb8-bcdef   1100m        6%     56416Mi         87%
napp-advanced-workers-ab123-7857587fb8-cdefg   1233m        7%     44107Mi         68%



Cause

Several pods in the nsxi-platform namespace consume large amounts of memory. Upon initial scheduling of the pod, Kubernetes may not have been aware of the large memory consumption, thus leading to an unbalanced distribution of memory consumption.

This information can be captured by using the following command from within the context of the napp Tanzu cluster: kubectl top pods -n nsxi-platform --sort-by=memory
 

NAME                                                              CPU(cores)   MEMORY(bytes)
druid-middle-manager-0                                            137m         11639Mi
druid-middle-manager-1                                            347m         10618Mi
druid-middle-manager-2                                            1442m        7621Mi
visualization-6779c4585b-abcde                                    1404m        7119Mi
druid-historical-1                                                50m          6960Mi
druid-historical-0                                                60m          6940Mi
druid-config-historical-0                                         156m         6678Mi
druid-broker-5b6f7fcd4-abcde                                      6m           6465Mi
anomalydetectionstreamingjob-628bfc8a24232f76-exec-1              3m           4657Mi
anomalydetectionstreamingjob-628bfc8a24232f76-exec-2              2m           4612Mi
druid-config-broker-6db5b7759b-abcde                              43m          4330Mi
overflowcorrelator-62b6548a242594b9-exec-2                        43m          3628Mi
overflowcorrelator-62b6548a242594b9-exec-1                        60m          3595Mi
overflowcorrelator-62b6548a242594b9-exec-3                        46m          3563Mi

 
The command kubectl get pods -n nsxi-platform -o wide will display which node the pod is scheduled to. This information can be correlated with the output from kubectl top nodes command to assist with which pods to prioritize.

Resolution

To resolve this, new worker nodes should be deployed to allow more overhead for pods to be distributed evenly. Using the "Scale Out Worker Node" section, provision new worker nodes to join the cluster. Once the cluster has returned to ready status, start deleting the high memory pods identified in the "Cause" section of this article.


kubectl delete pod -n nsxi-platform <pod-name>


The pod should schedule onto one of the newly created nodes. Repeat this process until the output of kubectl top nodes shows a more consistent memory usage amongst nodes.

The alarm on NSX-T should automatically go into the "Resolved" state once the memory usage exits the threshold.

Additional Information

Impact/Risks:

NSX Application Platform may show degraded status or may be temporarily offline. In some cases, there is no outage, but the memory usage is still high enough to trigger the alarm (>85% usage per node).

Note: If the worker nodes' memory is not evenly distributed, then we may just opt to delete the overutilized pods by identifying them using the steps below.

1. Identify Worker Nodes Under Memory Pressure

Use the following command to list all nodes and their current memory status, which will find the top memory-consuming nodes.

napp-k top nodes -o wide --sort-by=memory

Then describe the node suspected to be under pressure:

napp-k describe node <node-name>
2. List All Pods Running on the Affected Node
napp-k get pods -A -o wide | grep <node-name>

This will list all pods scheduled on that node with their namespaces.

3. Sort Pods by Memory Usage

Use this command to find the top memory-consuming pods:

napp-k top pod -A --sort-by=memory

Cross-reference with the list from Step 2 to identify the high memory pods on the specific node.

4. Delete High Memory Pods

Delete the pod(s). Kubernetes will automatically recreate the pod on the less utilized memory.

napp-k delete pod <pod-name> -n <namespace>

Repeat for all necessary pods contributing to the memory pressure.

5. Confirm Node Condition Has Recovered

Recheck the node condition:

napp-k top nodes -o wide --sort-by=memory

The alarm on the NSX manager should automatically go into the "Resolved" state