vDefend SSP Alarm: Security Services Platform node disk usage is high or very high
search cancel

vDefend SSP Alarm: Security Services Platform node disk usage is high or very high

book

Article ID: 384127

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

You are running SSP 5.0 and later and are seeing an alarm with the description:
"The disk storage usage of Security Services Platform node {{ .ResourceID }} is currently {{ .Value }}%, which is above the threshold value."

This alarm indicates that one or more worker nodes in your Security Services Platform (SSP) cluster are experiencing high disk storage usage, potentially impacting application performance, logging, and overall node health.

Environment

vDefend SSP >= 5.0

Cause

High disk usage on worker nodes can occur due to various reasons:

  1. Insufficient disk space: The allocated disk space may be inadequate for the current workload or data growth.
  2. Disk-intensive operations.

Resolution

  • In the node from the alarm, from SSPI root shell, execute  'k describe node { .ResourceID }' command and  pick any pod from the nsxi-platform namespace listed under the node. Restart the pod (deployment/statefulset):

    k -n nsxi-platform get pod <pod-name> -o jsonpath='{.metadata.ownerReferences[0].kind}'  (note: <service-name> is <pod-name> without hash number )

    If the output is StatefulSet, follow the StatefulSet restart steps.
    If the output is ReplicaSet, it belongs to a Deployment 

    If {{ service-name }} is stateful set, run:

    k -n nsxi-platform rollout restart statefulset {{ service-name }}

    Otherwise, run:

    k -n nsxi-platform rollout restart deployment {{ service-name }}

    Wait for ~10 minutes and check if pods are up and running. (k -n nsxi-platform get pods to check restarted pod are up)

    Note: When pod terminates, it causes a temporary unavailability of any services it provides until a new pod is scheduled and becomes ready.

  • If disk usage is still high, and then repeat the above step for additional pods if needed.
  • If the above steps do not help, on the SSPI UI, Navigate to Lifecycle Management → Instance management . Select Edit Deployment Size and increment the worker node count by +1.

    Once scale is done, Validate that new nodes are operational

    k get nodes

    Expected output : 
    NAME           STATUS   ROLES    AGE     VERSION
    node-1         Ready    <role>   xxm     v1.xx.x
    node-2         Ready    <role>   xxm     v1.xx.x
    new-node       Ready    <role>   xxm     v1.xx.x  # Ensure the new node is Ready

  • Verify Pod Distribution: Check if pods are distributed across nodes using  'k -n nsxi-platform get pods -o wide' , and repeat first two steps to get services across the new node.
  • If issue persists open a ticket with Broadcom.