Avi SE BGP flaps caused by an high surge/spike in TX/RX pps on the Service Engine data plane interfaces

Products

VMware Avi Load Balancer

Issue/Introduction

In scaled environments where virtual services are scaled to multiple Service Engines and are processing high amounts of connections, if a virtual service(s) flaps (vs_down/vs_up) due a network issue or health monitor failure at a peak time of open connections, it can trigger a surge in Tx/Rx pps from flow probes between service engines which affects the data plane NIC queue issues leading to BGP peer flaps and more health monitor failures, and a snowball effect.

Example of a Virtual Service with high number of connections:

Service Engine hosting this VS and correlating high number of tx/rx pps.

The flow_probes_req_sent and flow_probes_req_received counters on the Service Engine interfaces will be incrementing at a high rate and will show very high values.

On the Service Engine BGP logs you will find BGP is continuesly being marked down between the Service Engine and BGP peer.

$ grep -E 'NOTIFICATION|BGP_DOWN' /var/lib/avi/log/bgp/avi_ns2_bgpd.log | grep 2025 | head -n 6 
2025/03/14 20:27:41 BGP: %NOTIFICATION: sent to neighbor x.x.x.x 4/0 (Hold Timer Expired) 0 bytes
2025/03/14 20:27:41 BGP: BGP Adapter: BGP_DOWN for peer x.x.x.x (vrf: default) new refcount 0
2025/03/14 20:28:38 BGP: %NOTIFICATION: sent to neighbor x.x.x.x 4/0 (Hold Timer Expired) 0 bytes
2025/03/14 20:28:38 BGP: BGP Adapter: BGP_DOWN for peer x.x.x.x (vrf: default) new refcount 0
2025/03/14 20:29:27 BGP: %NOTIFICATION: sent to neighbor x.x.xx 4/0 (Hold Timer Expired) 0 bytes

Environment

Affects all versions with BGP/BFD configurations.

Cause

This issue was caused by a flap (vs_down) of a virtual service during a peak time of connections. This flap could have been due to a network/service pressure/issue in the backend.

This caused Avi to remove the VIP route from BGP which led to rehashing of the existing flows on other SEs leading to a surge in flow probes between service engines hosting the virtual services.

A high contributing factor to this issue is setting aggressive BGP and BFD timers.

Example:

BFD : mintx = minrx = 500 ms, detect-multiplier = 3
BGP : Keep-Alive Time = 1 Sec, Hold-Time = 5 Sec

***Note***:

Please adjust these values per your environment requirements.

Resolution

Please configured/set more relaxed BGP and BFD timers.

Example:

BGP : Keep-Alive Time = 5 Sec, Hold-Time = 15 Sec
BFD : mintx = minrx = 999 ms, detect-multiplier = 6

It's also recommended to configure alert for VS_DOWN events for these scaled out high connection virtual services and if possible create rate limiting rules in the event of a unexpected surge in traffic.

You can find more information on how to configure event based alerts and notifications, also details on how to configure rate limiters in the following documentation:

Alerts Overview

Rate Limiters

Workaround(s):

Disable flow probes temporarily to stabilize the system while the state of the virtual service recovers from a flapping vs_down/vs_up issue.

CLI commands to disable flow probes

> configure serviceenginegroup SEG_NAME
> disable_flow_probes
> save

CLI commands to enable flow probes

> configure serviceenginegroup SEG_NAME
> no disable_flow_probes
> save

Additional Information

Below you will find the information on how to gather the flow table state to determine if the flow probes issue is occurring.

Please focus on the flow_probes_rsp_received parameter, as it will confirm the existence of the flow. Once confirmed, the punt entry will be created on the Service Engine.

---

To fetch the flowtable statistics for a specific Service Engine (SE), use the following API endpoint:

GET /api/serviceengine/SE_UUID/flowtablestat

https://<controller-ip>/api/serviceengine/<SEuuid>/flowtablestat

In this API response, focus on the flow_probes_rsp_received metric, which tracks the number of flow probe response packets received by the SE.
Begin by establishing a baseline for the flow_probes_rsp_received metric across all Service Engines (SEs) that are associated with Virtual Services (VS).

You can clear the flowtablestat of the SEs to establish a baseline.

POST  /api/serviceengine/SE_UUID/flowtablestat/clear

https://<controller-ip>/serviceengine/SE_UUID/flowtablestat/clear

This will give you a reference point for the normal behavior in a steady state.
Under normal operation, you should expect this metric to remain very low, as rehashing should not be happening frequently in a stable environment.
Monitor the changes in the flow_probes_rsp_received values over time. Track the delta between the current value and the baseline value.
In normal circumstances, the delta should be minimal (less than 5), indicating that there are no significant changes in the flow probe responses.
If the delta exceeds 100 or more or any other predefined threshold (e.g., a significant spike), it indicates an anomaly that may require further investigation.
Action: Disable the flow probes

**NOTE**:

Once the flow probe recovery process completes (for instance : flow rehashing), all connection traffic will be routed through the SE as expected, meaning that traffic will again flow between SEs as per the updated flow table.
When disabling flow probes, it is crucial to adjust the TCP idle timeout in the network profile to a low value to ensure optimal resource utilization and prevent stale connections from lingering unnecessarily. The default value is 10min (600 sec).