In scaled environments where virtual services are scaled to multiple Service Engines and are processing high amounts of connections, if a virtual service(s) flaps (vs_down/vs_up) due a network issue or health monitor failure at a peak time of open connections, it can trigger a surge in Tx/Rx pps from flow probes between service engines which affects the data plane NIC queue issues leading to BGP peer flaps and more health monitor failures, and a snowball effect.
Example of a Virtual Service with high number of connections:
Service Engine hosting this VS and correlating high number of tx/rx pps.
The flow_probes_req_sent and flow_probes_req_received counters on the Service Engine interfaces will be incrementing at a high rate and will show very high values.
On the Service Engine BGP logs you will find BGP is continuesly being marked down between the Service Engine and BGP peer.
$ grep -E 'NOTIFICATION|BGP_DOWN' /var/lib/avi/log/bgp/avi_ns2_bgpd.log | grep 2025 | head -n 6
2025/03/14 20:27:41 BGP: %NOTIFICATION: sent to neighbor x.x.x.x 4/0 (Hold Timer Expired) 0 bytes
2025/03/14 20:27:41 BGP: BGP Adapter: BGP_DOWN for peer x.x.x.x (vrf: default) new refcount 0
2025/03/14 20:28:38 BGP: %NOTIFICATION: sent to neighbor x.x.x.x 4/0 (Hold Timer Expired) 0 bytes
2025/03/14 20:28:38 BGP: BGP Adapter: BGP_DOWN for peer x.x.x.x (vrf: default) new refcount 0
2025/03/14 20:29:27 BGP: %NOTIFICATION: sent to neighbor x.x.xx 4/0 (Hold Timer Expired) 0 bytes
Affects all versions with BGP/BFD configurations.
This issue was caused by a flap (vs_down) of a virtual service during a peak time of connections. This flap could have been due to a network/service pressure/issue in the backend.
This caused Avi to remove the VIP route from BGP which led to rehashing of the existing flows on other SEs leading to a surge in flow probes between service engines hosting the virtual services.
A high contributing factor to this issue is setting aggressive BGP and BFD timers.
Example:
BFD : mintx = minrx = 500 ms, detect-multiplier = 3
BGP : Keep-Alive Time = 1 Sec, Hold-Time = 5 Sec
***Note***:
Please adjust these values per your environment requirements.
Please configured/set more relaxed BGP and BFD timers.
Example:
BGP : Keep-Alive Time = 5 Sec, Hold-Time = 15 Sec
BFD : mintx = minrx = 999 ms, detect-multiplier = 6
It's also recommended to configure alert for VS_DOWN events for these scaled out high connection virtual services and if possible create rate limiting rules in the event of a unexpected surge in traffic.
You can find more information on how to configure event based alerts and notifications, also details on how to configure rate limiters in the following documentation:
Workaround(s):
Disable flow probes temporarily to stabilize the system while the state of the virtual service recovers from a flapping vs_down/vs_up issue.
CLI commands to disable flow probes
> configure serviceenginegroup SEG_NAME
> disable_flow_probes
> save
CLI commands to enable flow probes
> configure serviceenginegroup SEG_NAME
> no disable_flow_probes
> saveBelow you will find the information on how to gather the flow table state to determine if the flow probes issue is occurring.
Please focus on the flow_probes_rsp_received parameter, as it will confirm the existence of the flow. Once confirmed, the punt entry will be created on the Service Engine.
---
GET /api/serviceengine/SE_UUID/flowtablestat
https://<controller-ip>/api/serviceengine/<SEuuid>/flowtablestat
POST /api/serviceengine/SE_UUID/flowtablestat/clear
https://<controller-ip>/serviceengine/SE_UUID/flowtablestat/clear
**NOTE**: