NSX Application Platform Health Node Memory Usage Very High
search cancel

NSX Application Platform Health Node Memory Usage Very High

book

Article ID: 313950

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

To better balance the memory usage across the Tanzu cluster running Application Platform.

Symptoms:
NSX manager reports an alarm for worker node memory usage over 85%. This may cause instability or temporary outages on Application Platform. 

image.png

Using the command kubectl top nodes shows at least one node with >85% memory like the below output:

root [jumpbox]# kubectl top nodes
NAME                                           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
napp-advanced-control-plane-12345              220m         11%    3859Mi          49%
napp-advanced-workers-ab123-7857587fb8-abcde   5050m        31%    55423Mi         86%
napp-advanced-workers-ab123-7857587fb8-bcdef   1100m        6%     56416Mi         87%
napp-advanced-workers-ab123-7857587fb8-cdefg   1233m        7%     44107Mi         68%


Cause

Several pods in the nsxi-platform namespace consume large amounts of memory. Upon initial scheduling of the pod, Kubernetes may not have been aware of the large memory consumption, thus leading to an unbalanced distribution of memory consumption.

This information can be captured by using the following command from within the context of the napp Tanzu cluster: kubectl top pods -n nsxi-platform --sort-by=memory
 
NAME                                                              CPU(cores)   MEMORY(bytes)
druid-middle-manager-0                                            137m         11639Mi
druid-middle-manager-1                                            347m         10618Mi
druid-middle-manager-2                                            1442m        7621Mi
visualization-6779c4585b-abcde                                    1404m        7119Mi
druid-historical-1                                                50m          6960Mi
druid-historical-0                                                60m          6940Mi
druid-config-historical-0                                         156m         6678Mi
druid-broker-5b6f7fcd4-abcde                                      6m           6465Mi
anomalydetectionstreamingjob-628bfc8a24232f76-exec-1              3m           4657Mi
anomalydetectionstreamingjob-628bfc8a24232f76-exec-2              2m           4612Mi
druid-config-broker-6db5b7759b-abcde                              43m          4330Mi
overflowcorrelator-62b6548a242594b9-exec-2                        43m          3628Mi
overflowcorrelator-62b6548a242594b9-exec-1                        60m          3595Mi
overflowcorrelator-62b6548a242594b9-exec-3                        46m          3563Mi

 
The command kubectl get pods -n nsxi-platform -o wide will display which node the pod is scheduled to. This information can be correlated with the output from kubectl top nodes command to assist with which pods to prioritize.

Resolution

To resolve this new worker nodes should be deployed to allow more overhead for pods to be distributed evenly. Using the "Scale Out Worker Node" section of this hyperlinked doc page, provision new worker nodes to join the cluster. Once the cluster has returned to ready status, start deleting the high memory pods identified in the "Cause" section of this article.

kubectl delete pod -n nsxi-platform <pod-name>

The pod should schedule onto one of the newly created nodes. Repeat this process until the output of kubectl top nodes shows a more consistent memory usage amongst nodes.

The alarm on NSX-T should automatically go into "Resolved" state once the memory usage exits the threshold.

Additional Information

Impact/Risks:
NSX Application Platform may show degraded status, or may be temporarily offline. In some cases there is no outage, but the memory usage is still high enough to trigger the alarm (>85% usage per node).