vDefend SSP Alarm: Security Services Platform node CPU usage is high or very high
search cancel

vDefend SSP Alarm: Security Services Platform node CPU usage is high or very high

book

Article ID: 384126

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

You are running SSP 5.0 and later and are seeing an alarm with the description:
"The CPU usage of Security Services Platform node {{ .ResourceID }} is currently {{ .Value }}%, which is above the threshold value."

This alarm indicates that one or more worker nodes in your Security Services Platform (SSP) cluster are experiencing high CPU usage, which may impact platform performance or service availability.

Environment

vDefend SSP >= 5.0

Cause

High CPU usage on worker nodes typically occurs when the current CPU limits and resources are insufficient to handle the workload. 

 

Resolution

  • Restart the deployment/statefulset which has high CPU usage in that particular affected node:

    Log into SSPI root shell. The following commands can help to figure out which pods are consuming high resources. 

    k get nodes
    k top nodes
    k describe node <worker node name> -> will show all the pods in this node
    k top pods -n nsxi-platform --sort-by=cpu

    k -n nsxi-platform get pod <pod-name> -o jsonpath='{.metadata.ownerReferences[0].kind}'

    If the output is StatefulSet, follow the StatefulSet restart steps.
    If the output is ReplicaSet, it belongs to a Deployment 
    If {.ResourceID } is stateful set, run:

    k -n nsxi-platform rollout restart statefulset {.ResourceID } 

    Otherwise, run:

    k -n nsxi-platform rollout restart deployment {.ResourceID }

    Wait for ~20 minutes and check if the alarm is auto-resolved. (k -n nsxi-platform get pods to check restarted pod are up)

    this should reschedule the pods to a newer node, and hence bring down CPU usage in the affected node.

  • In case, it does not resolve , On the SSPI UI, Navigate to Lifecycle Management → Instance management → Edit Deployment Size to increase the nodes by 1, and execute previous step to restart deployment/sts to reschedule the pods to newer node.
  • Another option would be to scaleout. 

On the Security Services Platform UI:

Navigate to System > Platform & Features > Core Services

If the CPU intensive applications are any of the following, scale out corresponding category. 

rawflowcorrelator, overflowcorrelator / druid-middle-manager, druid-broker/ latestflow   - Analytics
kafka-controller, kafka-broker - Messaging
minio - Data Storage 
metrics-manager, metrics-app-server,metrics-query-server - Metrics (Refer KB: 384109 for metrics specific CPU spike issues)

Additional Information