You are running SSP 5.0 and later and are seeing an alarm with the description:
"The memory usage of Security Services Platform node {{ .ResourceID }} is currently {{ .Value }}%, which is above the threshold value."
This alarm indicates that one or more worker nodes in your Security Services Platform (SSP) cluster are experiencing high memory usage, which may degrade platform performance or cause pods to be evicted.
vDefend SSP >= 5.0
High memory usage on worker nodes typically occurs when the current memory limits and resources are insufficient to handle the workload.
Log into SSPI root shell. The following commands can help to figure out which pods are high on consuming resources.
k top pods -n nsxi-platform --sort-by=memory
->
list of pods sorted by memory usage
k describe node <Resource ID> ->
list of pods in node with high memory usage alarm
Using the output from above commands, Select the pods with high memory usage part of the node with high memory usage alarm
k -n nsxi-platform get pod <pod-name> -o jsonpath='{.metadata.ownerReferences[0].kind}'
(Note: <service-name>
is <pod-name>
without hash number )
If the output is StatefulSet
, follow the StatefulSet
restart steps.
If the output is ReplicaSet
, it belongs to a Deployment
If {service-name}
is stateful set, run:
k -n nsxi-platform rollout restart statefulset {service-name}
Otherwise, run:
k -n nsxi-platform rollout restart deployment {service-name}
Wait for ~20 minutes and check if the alarm is auto-resolved. (k -n nsxi-platform get pods
to check restarted pod are up)
this should reschedule the pods to a newer node, and hence bring down memory usage in the fault node.
Once scale is done, Validate that new nodes are operational
k get nodes
Expected output :
NAME STATUS ROLES AGE VERSION
node-1 Ready <role> xxm v1.xx.x
node-2 Ready <role> xxm v1.xx.x
new-node Ready <role> xxm v1.xx.x # Ensure the new node is Ready
On the Security Services Platform UI:
Navigate to System > Platform & Features > Core Services
If the memory intensive applications are any of the following, scale out corresponding category.
rawflowcorrelator, overflowcorrelator / druid-middle-manager, druid-broker/ latestflow
- Analyticskafka-controller, kafka-broker
- Messagingminio
- Data Storage metrics-manager, metrics-app-server,metrics-query-server
- Metrics (Refer to KB: 384111 for metrics specific memory spike issues)
For further troubleshooting if node is degraded / down.
https://techdocs.broadcom.com/us/en/vmware-security-load-balancing/vdefend/security-services-platform/5-0/security-services-platform-installer/troubleshooting-sspi/troubleshooting-workload-cluster.html