vDefend SSP Alarm: Security Services Platform node memory usage is high or very high
search cancel

vDefend SSP Alarm: Security Services Platform node memory usage is high or very high

book

Article ID: 384128

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

You are running SSP 5.0 and later and are seeing an alarm with the description:
"The memory usage of Security Services Platform node {{ .ResourceID }} is currently {{ .Value }}%, which is above the threshold value."

This alarm indicates that one or more worker nodes in your Security Services Platform (SSP) cluster are experiencing high memory usage, which may degrade platform performance or cause pods to be evicted.

Environment

vDefend SSP >= 5.0

Cause

High memory usage on worker nodes typically occurs when the current memory limits and resources are insufficient to handle the workload. 

 

Resolution

  • Restart the deployment/statefulset which has high memory usage in that particular fault node:

    Log into SSPI root shell. The following commands can help to figure out which pods are high on consuming resources. 

    k top pods -n nsxi-platform --sort-by=memory   -> list of pods sorted by memory usage

    k describe node <Resource ID>  -> list of pods in node with high memory usage alarm

    Using the output from above commands, Select the pods with high memory usage part of the node with high memory usage alarm

    k -n nsxi-platform get pod <pod-name> -o jsonpath='{.metadata.ownerReferences[0].kind}'  (Note: <service-name> is <pod-name> without hash number )

    If the output is StatefulSet, follow the StatefulSet restart steps.
    If the output is ReplicaSet, it belongs to a Deployment 

    If {service-name} is stateful set, run:

    k -n nsxi-platform rollout restart statefulset {service-name} 

    Otherwise, run:

    k -n nsxi-platform rollout restart deployment {service-name} 

    Wait for ~20 minutes and check if the alarm is auto-resolved. (k -n nsxi-platform get pods to check restarted pod are up)

    this should reschedule the pods to a newer node, and hence bring down memory usage in the fault node. 

  • In case, it does not resolve , On the SSPI UI, Navigate to Lifecycle Management → Instance management → Edit Deployment Size to increase the nodes by 1

    Once scale is done, Validate that new nodes are operational

    k get nodes

    Expected output : 
    NAME           STATUS   ROLES    AGE     VERSION
    node-1         Ready    <role>   xxm     v1.xx.x
    node-2         Ready    <role>   xxm     v1.xx.x
    new-node       Ready    <role>   xxm     v1.xx.x  # Ensure the new node is Ready

  • Execute first step to restart deployment/sts to reschedule the pods to newer node.
  • Another option would be to scaleout.

    On the Security Services Platform UI:

    Navigate to System > Platform & Features > Core Services

    If the memory intensive applications are any of the following, scale out corresponding category. 

    rawflowcorrelator, overflowcorrelator / druid-middle-manager, druid-broker/ latestflow   - Analytics
    kafka-controller, kafka-broker - Messaging
    minio - Data Storage 
    metrics-manager, metrics-app-server,metrics-query-server - Metrics (Refer to KB: 384111 for metrics specific memory spike issues)

Additional Information