To better balance the memory usage across the Tanzu cluster running Application Platform.
Symptoms:
NSX manager reports an alarm for worker node memory usage over 85%. This may cause instability or temporary outages on Application Platform.
Using the command kubectl top nodes shows at least one node with >85% memory like the below output:
Several pods in the nsxi-platform namespace consume large amounts of memory. Upon initial scheduling of the pod, Kubernetes may not have been aware of the large memory consumption, thus leading to an unbalanced distribution of memory consumption.
This information can be captured by using the following command from within the context of the napp Tanzu cluster: kubectl top pods -n nsxi-platform --sort-by=memory
To resolve this, new worker nodes should be deployed to allow more overhead for pods to be distributed evenly. Using the "Scale Out Worker Node" section, provision new worker nodes to join the cluster. Once the cluster has returned to ready status, start deleting the high memory pods identified in the "Cause" section of this article.
The pod should schedule onto one of the newly created nodes. Repeat this process until the output of kubectl top nodes shows a more consistent memory usage amongst nodes.
The alarm on NSX-T should automatically go into the "Resolved" state once the memory usage exits the threshold.
Impact/Risks:
NSX Application Platform may show degraded status or may be temporarily offline. In some cases, there is no outage, but the memory usage is still high enough to trigger the alarm (>85% usage per node).
Note: If the worker nodes' memory is not evenly distributed, then we may just opt to delete the overutilized pods by identifying them using the steps below.
Use the following command to list all nodes and their current memory status, which will find the top memory-consuming nodes.
napp-k top nodes -o wide --sort-by=memory
Then describe the node suspected to be under pressure:
napp-k describe node <node-name>
napp-k get pods -A -o wide | grep <node-name>
This will list all pods scheduled on that node with their namespaces.
Use this command to find the top memory-consuming pods:
napp-k top pod -A --sort-by=memory
Cross-reference with the list from Step 2 to identify the high memory pods on the specific node.
Delete the pod(s). Kubernetes will automatically recreate the pod on the less utilized memory.
napp-k delete pod <pod-name> -n <namespace>
Repeat for all necessary pods contributing to the memory pressure.
Recheck the node condition:
napp-k top nodes -o wide --sort-by=memory
The alarm on the NSX manager should automatically go into the "Resolved" state