A "Memory Usage Warning" event is generated by the VMware SD-WAN Edge when the Resource Monitor process detects Edge memory utilization has exceeded defined thresholds. This wil cause the Edge Service to restart.
Once the event is created, it is communicated to VMware SD-WAN Orchestrator as an alert.
VMWare SD-WAN
The Edge uses a Resource Monitor process that periodically calculates the Edge's available memory. The calculation takes the total Edge RAM and subtracts any reserved memory (e.g. security VNF allocation if applicable, etc.) and then updates this difference as the available Edge memory. The remaining memory is then shared between several SD-WAN supporting processes.
In order to avoid an “out of memory” condition for the most critical Edge process, edged, certain thresholds are defined to alert of potential danger. By default the Resource Monitor checks memory consumption every 5 seconds. As long as the current value is within the specified memory thresholds, the Resource Monitor will only log an internal sample every 15 minutes.
If memory utilization is 40 - 59% of the available memory, a warning event EDGE_MEMORY_USAGE_WARNING
is sent to the Orchestrator. This event will be sent every 60 minutes until the memory usage drops under the 40% threshold.
Event : Memory Usage Warning
Severity: Warning
Event Detail: Process edged memory usage (## MB) exceeds ##% threshold
If the memory utilization reaches 60%, the Resource Monitor waits for 90 seconds to allow the edged
process to recover from a possible temporary spike in memory usage. If memory usage persists at a 60% or higher level for more than 90 seconds, the Edge will generate the error message EDGE_MEMORY_USAGE_ERROR
and send this to the Orchestrator. The Orchestrator will translate this into the Memory Usage Critical event and post this to the Orchestrator's Events page. When a Memory Usage Critical event is detected, the Edge's Resource Monitor restarts the edged
process to clear the Edge's memory (this is also referred to as an Edge Service restart). Restarting the edged
process results in a 15-30 second disruption of customer traffic.
Event: Memory Usage Warning
Severity: Error
Event Detail: Process edged memory usage (## MB) exceeds 60% threshold
Note: The Edge service memory usage that triggers an Edge service restart changed in Q1 of 2021. In earlier Edge releases, the Edge service memory usage threshold that triggered alerts was 50% and the level that triggered an Edge service restart was 70%. After the change the thresholds changed to 40% to trigger a warning alert, and 60% for an Edge service restart after sustaining that level for greater than 90 seconds.
The change to 40% Alert / 60% Restart memory thresholds begins with each of the following Edge software release trains:
An Edge using software that is earlier than the listed software in that release train (for example, 3.4.5, 4.0.2, or 4.2.1) would use the older 50% / 70% thresholds.
One of the most common reasons for high memory usage under normal conditions is an excessive number of flows for the affected Edge model. The flow count can be verified from the Monitor > Edge > System tab. The Edge flow capacity may be confirmed by consulting the latest Edge data sheet.
If the flow count does indeed exceed the Edge model specifications there are two things to consider:
If the flow count is well within recommended limits for this Edge model per the data sheet, there is the rarer possibility the Edge is suffering a memory leak. The risk of an Edge memory leak increases with the age of the Edge Software release the Edge is using.
One sign of a potential memory leak is if the "Memory Usage Warning" events delivered to the Orchestrator every hour show a slowly incrementing usage of memory for each event. And this incremental increase is observed over days or even weeks. It is important to ensure that the affected Edge is using the latest VMware SD-WAN Edge software release as the latest build would include all the fixes for various Edge memory leaks resolved to that release date.
If the Edge is indeed suffering a memory leak and either an upgrade needs to be scheduled or an upgrade cannot be scheduled for some time, the workaround to prevent an unexpected traffic disruption is to schedule a maintenance window when an administrator can restart the Edge service and clear the memory. And Edge Service restart may be done on the Orchestrator by going to Remote Actions > Restart Service. Customer traffic would be disrupted for 15-30 seconds when executing this action. Restarts would be scheduled based on how quickly the memory usage was increasing over time, the idea being to schedule it before it reaches 65%.
If the Edge is running the latest software and there are still memory warnings with flow counts well within Edge specifications, please engage the VMware SD-WAN Support Team as outlined here: Contact VMware by Broadcom SDE Support.
There are three ways to proactively monitor Edge memory utilization:
On the Orchestrator UI, memory usage can be monitored by going to the Monitor > Edge > System tab.
If allowed by configuration, snmp can provide the information using the .iso.org.dod.internet.private.enterprises.velocloud.modules.edge.vceEdgeObject.vceHealth.vceHealthObject.vceMemUsedPct
oid.
The same value can also be obtained by calling the API metrics/getEdgeStatusMetrics
method.
The memory utilization values obtained via any of these methods is not the utilization of just the edged
process, but rather an indication of the system wide memory usage