NSX Application Platform Node Disk Usage High/Very High Alarm
search cancel

NSX Application Platform Node Disk Usage High/Very High Alarm

book

Article ID: 378730

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

NSX Application Platform Node Disk Usage High Alarm or NSX Application Platform Node Disk Usage Very High Alarm is in Open state with he following description.

  • Feature Name: NSX Application Platform Health
  • Event Type:  Node Disk Usage High or Very High
  • Description: The disk usage of NSX Application Platform node {napp_node_name} is above the high threshold value of {system_usage_threshold}%.

Environment

All NAPP Versions

Resolution

Resolution:

  1. Identify the Type of Node:

    • First, check whether the node under pressure is a Worker Node or a Control Plane Node. You can find this information in the Alarm Description by expanding the details of the generated alarm. Note down the node name.
    • Navigate to System → NSX Application Platform → Resources.
    • Hover over each node name to display its full name and locate the one matching the node name noted from the Alarm Description.
    • To differentiate the node type:
      • Control Plane Node will be explicitly indicated below the node name.
      • If it's a Worker Node, no type will be specified.

  2. For Worker Node Alarm:

    • Step 1: Add Nodes to the Kubernetes Cluster:
      • If disk usage is high, add one or more nodes to the existing Kubernetes cluster (contact your Kubernetes provider for assistance).

    • Step 2: Free Up Disk Space by Deleting Pods:
      • On the NSX Manager root shell, run the following command to locate the relevant pods:
        napp-k get pods -o wide | grep <napp_node_name>
        
      • Replace <napp_node_name> with the node name found in the open alarm.
      • Delete some of the pods from the list to allow Kubernetes to reschedule them on the newly added node:
        napp-k delete pod <pod-name>
        
      • The following pods are recommended for deletion if present in the list:
        • cluster-api
        • monitor
      • Steps to get pod names:
        • For cluster-api pod:
          napp-k get pod -l=app.kubernetes.io/name=cluster-api
          
        • For monitor pod:
          napp-k get pod -l=app.kubernetes.io/name=monitor,cluster-api-client=true
          
    • Step 3: Check the disk space:
      • After deleting the necessary pods, Kubernetes should eventually reclaim disk space by removing unused container images, which is often the main cause of high disk usage.

    If the above steps do not resolve the issue, please contact your Kubernetes service provider or Broadcom support for further assistance in clearing disk space from the Kubernetes nodes.

  3. For Control Plane Node Alarm:

    • Step 1: Identify the Node Under Pressure:
      • In the Resources tab of the NSX Application Platform UI, check the "Storage" field for each control plane node. Hover over each node name to reveal the full name and identify the control plane node that is under pressure.

    • Step 2: Access the Control Plane of the Cluster Node:
      • Log in to the Supervisor Cluster:
        • SSH into the vCenter Server Appliance, log in as root, and switch to shell mode:
          shell
          
        • Retrieve the Supervisor Control Plane (SCP) IP address and credentials:
          /usr/lib/vmware-wcp/decryptK8Pwd.py
          
        • Example output:
          Cluster: domain-c46:def22104-2b40-4048-b049-271b1de46b94  
          IP: 10.99.2.10  
          PWD: 3lnCN5ccPhg0cl1WQTZTGNzL[...]  
          
        • SSH into the Supervisor Cluster using the retrieved IP and password:
          ssh [email protected]
          
      • Step 3: Log in to the Guest Cluster from the Supervisor Cluster:
        • List available Guest Clusters:
          kubectl get tkc -A
          
        • Retrieve the SSH password for the Guest Cluster:
          kubectl get secrets <guest-cluster-name>-ssh-password -n <namespace> -o yaml
          
        • Example output:
          apiVersion: v1
          data:
            ssh-passwordkey: S2J1OWNCZ01XbXNzMm1JaW1GMmJxMTZnNHV0YjFWYUdYS2FkQjVVcmpUYz0=
          
        • Decode the password:
          echo <copied-ssh-passwordkey> | base64 -d
          
        • Save the decoded password for use.

      • Step 4: SSH into the Guest Cluster Control Plane Node:
        • List the machines in the Supervisor Cluster to identify the control plane node's IP:
          kubectl get vm -A -owide
          
        • SSH into the Guest Cluster control plane node:
          ssh vmware-system-user@<control-plane-node-ip>
          
        • Enter the password obtained earlier to access the Guest Cluster control plane.

    • Step 5: Check Disk Usage on the Control Plane Node:
      • Navigate to the /var/log directory and run the following command to check for large files, such as journal logs:
        du -h --max-depth=1
        
      • If journal logs are consuming space, reduce their size:
        journalctl --vacuum-size=500M
        
      • You can also limit the retention period of journal logs to the last 2 days:
        journalctl --vacuum-time=2d
        
    • Step 6: Verify Disk Usage:
      • After performing these steps, wait for 5-10 minutes and check the disk usage status again in the NSX Application Platform UI under the Resources tab. The alarm should also auto-resolve.