vDefend SSP Alarm: Security Services Platform node memory usage is high or very high

search cancel

vDefend SSP Alarm: Security Services Platform node memory usage is high or very high

book

Article ID: 384128

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

You are running SSP 5.0 and later and are seeing an alarm with the description:
"The memory usage of Security Services Platform node {{ .ResourceID }} is currently {{ .Value }}%, which is above the threshold value."

This alarm indicates that one or more worker nodes in your Security Services Platform (SSP) cluster are experiencing high memory usage, which may degrade platform performance or cause pods to be evicted.

Environment

vDefend SSP >= 5.0

Cause

High memory usage on worker nodes typically occurs when the current memory limits and resources are insufficient to handle the workload.

Resolution

Restart the deployment/statefulset which has high memory usage in that particular fault node:

Log into SSPI root shell. The following commands can help to figure out which pods are high on consuming resources.

k top pods -n nsxi-platform --sort-by=memory -> list of pods sorted by memory usage

k describe node <Resource ID> -> list of pods in node with high memory usage alarm

Using the output from above commands, Select the pods with high memory usage part of the node with high memory usage alarm

k -n nsxi-platform get pod <pod-name> -o jsonpath='{.metadata.ownerReferences[0].kind}' (Note: <service-name> is <pod-name> without hash number )

If the output is StatefulSet, follow the StatefulSet restart steps.
If the output is ReplicaSet, it belongs to a Deployment

If {service-name} is stateful set, run:

k -n nsxi-platform rollout restart statefulset {service-name}

Otherwise, run:

k -n nsxi-platform rollout restart deployment {service-name}

Wait for ~20 minutes and check if the alarm is auto-resolved. (k -n nsxi-platform get pods to check restarted pod are up)

this should reschedule the pods to a newer node, and hence bring down memory usage in the fault node.
In case, it does not resolve , On the SSPI UI, Navigate to Lifecycle Management → Instance management → Edit Deployment Size to increase the nodes by 1

Once scale is done, Validate that new nodes are operational

k get nodes

Expected output :
NAME STATUS ROLES AGE VERSION
node-1 Ready <role> xxm v1.xx.x
node-2 Ready <role> xxm v1.xx.x
new-node Ready <role> xxm v1.xx.x # Ensure the new node is Ready
Execute first step to restart deployment/sts to reschedule the pods to newer node.
Another option would be to scaleout.

On the Security Services Platform UI:

Navigate to System > Platform & Features > Core Services

If the memory intensive applications are any of the following, scale out corresponding category.

rawflowcorrelator, overflowcorrelator / druid-middle-manager, druid-broker/ latestflow - Analytics
kafka-controller, kafka-broker - Messaging
minio - Data Storage
metrics-manager, metrics-app-server,metrics-query-server - Metrics (Refer to KB: 384111 for metrics specific memory spike issues)

Additional Information

For further troubleshooting if node is degraded / down.
https://techdocs.broadcom.com/us/en/vmware-security-load-balancing/vdefend/security-services-platform/5-0/security-services-platform-installer/troubleshooting-sspi/troubleshooting-workload-cluster.html

Feedback

thumb_up Yes

thumb_down No