"Memory Usage Warning" event

Products

VMWare SD-WAN

Issue/Introduction

A "Memory Usage Warning" event is generated by the VMware SD-WAN Edge when the Resource Monitor process detects Edge memory utilization has exceeded defined thresholds. This wil cause the Edge Service to restart.

Once the event is created, it is communicated to VMware SD-WAN Orchestrator as an alert.

Environment

VMWare SD-WAN

Resolution

How the VMware SD-WAN Edge Determines a Memory Utilization Issue

The Edge uses a Resource Monitor process that periodically calculates the Edge's available memory. The calculation takes the total Edge RAM and subtracts any reserved memory (e.g. security VNF allocation if applicable, etc.) and then updates this difference as the available Edge memory. The remaining memory is then shared between several SD-WAN supporting processes.

In order to avoid an “out of memory” condition for the most critical Edge process, edged, certain thresholds are defined to alert of potential danger. By default the Resource Monitor checks memory consumption every 5 seconds. As long as the current value is within the specified memory thresholds, the Resource Monitor will only log an internal sample every 15 minutes.

What an Edge Does When a Memory Utilization Issue is Detected

If memory utilization is 40 - 59% of the available memory, a warning event EDGE_MEMORY_USAGE_WARNING is sent to the Orchestrator. This event will be sent every 60 minutes until the memory usage drops under the 40% threshold.

Event : Memory Usage Warning
Severity: Warning
Event Detail: Process edged memory usage (## MB) exceeds ##% threshold

If the memory utilization reaches 60%, the Resource Monitor waits for 90 seconds to allow the edged process to recover from a possible temporary spike in memory usage. If memory usage persists at a 60% or higher level for more than 90 seconds, the Edge will generate the error message EDGE_MEMORY_USAGE_ERROR and send this to the Orchestrator. The Orchestrator will translate this into the Memory Usage Critical event and post this to the Orchestrator's Events page. When a Memory Usage Critical event is detected, the Edge's Resource Monitor restarts the edged process to clear the Edge's memory (this is also referred to as an Edge Service restart). Restarting the edged process results in a 15-30 second disruption of customer traffic.

Event: Memory Usage Warning
Severity: Error
Event Detail: Process edged memory usage (## MB) exceeds 60% threshold

Note: The Edge service memory usage that triggers an Edge service restart changed in Q1 of 2021. In earlier Edge releases, the Edge service memory usage threshold that triggered alerts was 50% and the level that triggered an Edge service restart was 70%. After the change the thresholds changed to 40% to trigger a warning alert, and 60% for an Edge service restart after sustaining that level for greater than 90 seconds.

The change to 40% Alert / 60% Restart memory thresholds begins with each of the following Edge software release trains:

3.4.6
4.0.3
4.1.2
4.2.1
4.2.2
4.3.0
4.5.0
5.0.0.0

An Edge using software that is earlier than the listed software in that release train (for example, 3.4.5, 4.0.2, or 4.2.1) would use the older 50% / 70% thresholds.

Potential Cause: Flow Count Exceeds an Edge Model's Capacity

One of the most common reasons for high memory usage under normal conditions is an excessive number of flows for the affected Edge model. The flow count can be verified from the Monitor > Edge > System tab. The Edge flow capacity may be confirmed by consulting the latest Edge data sheet.

If the flow count does indeed exceed the Edge model specifications there are two things to consider:

Using the Orchestrator Monitoring tools, verify if the traffic generated on this Edge is legitimate for the customer network. If unexpected sources, destinations, or applications are observed, please address this traffic as this may lower the flow count to expected levels and resolve the memory utilization issues for this Edge.
If traffic analysis indicates nothing wrong with the traffic but the flow count does exceed the capacity for that particular Edge model, consider replacing the existing Edge with a model possessing a greater flow capacity.

Potential Cause: Edge Memory Leak

If the flow count is well within recommended limits for this Edge model per the data sheet, there is the rarer possibility the Edge is suffering a memory leak. The risk of an Edge memory leak increases with the age of the Edge Software release the Edge is using.

One sign of a potential memory leak is if the "Memory Usage Warning" events delivered to the Orchestrator every hour show a slowly incrementing usage of memory for each event. And this incremental increase is observed over days or even weeks. It is important to ensure that the affected Edge is using the latest VMware SD-WAN Edge software release as the latest build would include all the fixes for various Edge memory leaks resolved to that release date.

Workaround for a Memory Leak:

If the Edge is indeed suffering a memory leak and either an upgrade needs to be scheduled or an upgrade cannot be scheduled for some time, the workaround to prevent an unexpected traffic disruption is to schedule a maintenance window when an administrator can restart the Edge service and clear the memory. And Edge Service restart may be done on the Orchestrator by going to Remote Actions > Restart Service. Customer traffic would be disrupted for 15-30 seconds when executing this action. Restarts would be scheduled based on how quickly the memory usage was increasing over time, the idea being to schedule it before it reaches 65%.

If the Edge is running the latest software and there are still memory warnings with flow counts well within Edge specifications, please engage the VMware SD-WAN Support Team as outlined here: Contact VMware by Broadcom SDE Support.

Proactively Monitoring Edge Memory Utilization

There are three ways to proactively monitor Edge memory utilization:

On the Orchestrator UI, memory usage can be monitored by going to the Monitor > Edge > System tab.
If allowed by configuration, snmp can provide the information using the .iso.org.dod.internet.private.enterprises.velocloud.modules.edge.vceEdgeObject.vceHealth.vceHealthObject.vceMemUsedPct oid.
The same value can also be obtained by calling the API metrics/getEdgeStatusMetrics method.

The memory utilization values obtained via any of these methods is not the utilization of just the edged process, but rather an indication of the system wide memory usage