NSX alarm "Edge Memory Usage Very High" results in Edge failover and datapath disruption
search cancel

NSX alarm "Edge Memory Usage Very High" results in Edge failover and datapath disruption

book

Article ID: 412629

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

NSX manager may show an alarm for "Edge Memory Usage Very High" with similar details as "The memory usage on Edge node <UUID> has reach 93% which is at or above the very high threshold value of 90%" randomly. 

Accompanying to the same alarm, there might be another alarm for "Mangement Channel on <NSX-Manager-Node> to Transport Node <NSX-Edege-Node> (NSX-Edge-IP) is down for 5 minutes" and NSX edge in question may have failed over its active role to a standby edge node when in an active/standby HA configuration. 

NSX IDPS has rules that inspect SMB traffics either explicitly or implicitly. 

During the alarm period, there is a high throughput of SMB traffic.

Environment

VMware NSX 4.2.x

VMware VCF 9.0

Cause

High volume of SMB traffic/inspection causes IDPS to consume large amount of memory. 

Resolution

To resolve this issue, please open a Broadcom Support Requesting and upload the following required data/logs:

  • All three NSX manager logs
  • NSX edge logs for the issue edge cluster
  • NSX edge VM memory heap maps:
    1. SSH into the problem edge as root user (if both edge experience the same issue, pick the active one) 
    2. Start Heap Profiling:
      This command initiates the heap profiling, using /var/log/dp_heap as the base name for the output files.

      # edge-appctl -t /var/run/vmware/edge/dpd.ctl heap_profile/start /var/log/dp_heap

    3. Take Initial Snapshot (Mark Point 1):
      This command captures the current heap state at the beginning of our observation period.

      # edge-appctl -t /var/run/vmware/edge/dpd.ctl heap_profile/dump test0

    4. Allow Collection to Run (Wait Period):
      Allow the heap profiling to run in the background for several hours, ideally overnight. During this time, no further commands related to heap profiling should be executed, and the dpd process should remain running.

      Important: If the dpd process restarts or an Out-Of-Memory (OOM) kill occurs during this collection period, you will need to restart the entire collection process from Step 2.

    5. Take Final Snapshot (Mark Point 2):
      After the desired collection period, take a second snapshot of the heap to capture its state at the end of the observation.

      # edge-appctl -t /var/run/vmware/edge/dpd.ctl heap_profile/dump test1

    6. Stop Heap Profiling:
      This command gracefully stops the heap profiling process.

      # edge-appctl -t /var/run/vmware/edge/dpd.ctl heap_profile/stop

    7. Verify Profiling is Stopped:
      Confirm that the profiling has successfully stopped by checking its state. The output should show "state": "stopped".
      bash # edge-appctl -t /var/run/vmware/edge/dpd.ctl heap_profile/state
      Expected output: json { "state": "stopped"}

    8. Locating the Data Files:

      After completing these steps, you should find two heap profile files in the /var/log/ directory:

      /var/log/dp_heap.0001.heap
      /var/log/dp_heap.0002.heap

    9. Upload both heap files to the Broadcom Support Request along with other support bundles.