The purpose of this article is to explain the current design and provide a workaround for customers who are experiencing this issue.
Symptoms:
Site-to-site connectivity is broken for a specific set of subnets.
The remote subnets to which the traffic might fail could be connected subnets to the destination device.
This is observed in setups where you might expect a fallback of traffic from one SD-WAN site to another which is advertising a less specific prefix.
Orchestrator Remote diagnostics - Route Table Dump shows the impacted route reachability as FALSE.
Let's consider the below setup.
Route 172.16.3.0/29 is advertised to the overlay as a connected subnet behind b3-edge1
Route 172.16.3.0/28 is advertised to the overlay as a static subnet behind b2-edge1
b1-edge1 learns these two routes from overlay. The above setup consists of two gateways hence we see two routes one from each.
CLI output: (For on-premise customers who manage VCEs on their own)
edge:b1-edge1(active):~# debug.py --routes | grep -Ei "Address|172.16.3.0" Address Netmask Type Gateway Next Hop Name Next Hop ID Destination Name Dst LogicalId Reachable Metric Preference Flags Vlan Intf Sub IntfId MTU SEG 172.16.3.0 255.255.255.248 edge2edge any gateway-1 d2fe9e58-3deb-48c6-94c8-0c2154c36eb4 b3-edge1 fd6b6ead-f7a0-4d6c-864d-6907747a1f25 True 0 0 SR 0 any N/A 1500 0 172.16.3.0 255.255.255.248 edge2edge any gateway-2 327aad1c-1cfe-4a6e-83a5-3c6d85a3918d b3-edge1 fd6b6ead-f7a0-4d6c-864d-6907747a1f25 True 0 0 SR 0 any N/A 1500 0 172.16.3.0 255.255.255.240 edge2edge any gateway-1 d2fe9e58-3deb-48c6-94c8-0c2154c36eb4 b2-edge1 fac9c633-bee2-4fca-b633-f2b70a7c1a34 True 0 0 SR 65535 any N/A 1500 0 172.16.3.0 255.255.255.240 edge2edge any gateway-2 327aad1c-1cfe-4a6e-83a5-3c6d85a3918d b2-edge1 fac9c633-bee2-4fca-b633-f2b70a7c1a34 True 0 0 SR 65535 any N/A 1500 0
When the remote edge goes offline or is powered off or is deactivated, the route would turn "FALSE".
Traffic initiated from the clients behind b1-edge1 is going to be dropped on b1-edge1 as it hits the “FALSE” route although there is a prefix 172.16.2.0/28 being learnt via b2-edge1
Output on b1-edge1
edge:b1-edge1(active):~# debug.py --routes | grep -Ei "Address|172.16.3.0" Address Netmask Type Gateway Next Hop Name Next Hop ID Destination Name Dst LogicalId Reachable Metric Preference Flags Vlan Intf Sub IntfId MTU SEG 172.16.3.0 255.255.255.248 edge2edge any gateway-1 d2fe9e58-3deb-48c6-94c8-0c2154c36eb4 b3-edge1 fd6b6ead-f7a0-4d6c-864d-6907747a1f25 False 0 0 SR 0 any N/A 1500 0 172.16.3.0 255.255.255.248 edge2edge any gateway-2 327aad1c-1cfe-4a6e-83a5-3c6d85a3918d b3-edge1 fd6b6ead-f7a0-4d6c-864d-6907747a1f25 False 0 0 SR 0 any N/A 1500 0 172.16.3.0 255.255.255.240 edge2edge any gateway-1 d2fe9e58-3deb-48c6-94c8-0c2154c36eb4 b2-edge1 fac9c633-bee2-4fca-b633-f2b70a7c1a34 True 0 0 SR 65535 any N/A 1500 0 172.16.3.0 255.255.255.240 edge2edge any gateway-2 327aad1c-1cfe-4a6e-83a5-3c6d85a3918d b2-edge1 fac9c633-bee2-4fca-b633-f2b70a7c1a34 True 0 0 SR 65535 any N/A 1500 0
Initiated ICMP traffic from a client behind b1-edge1 towards 172.16.3.3
On CLI, we would be able to see the below drops when a packet tracker is configured.
edge:b1-edge1(active):~# debug.py --pkt_tracker 10.0.1.25 any 172.16.3.3 any 1 20 {'count': 20, 'sip': '10.0.1.25', 'proto': '1', 'dport': 'any', 'debug': 'pkt_track', 'sport': 'any', 'dip': '172.16.3.3'} { "Success": "Logging of packets started" } 2023-03-10T01:09:39.764 INFO [NET] vc_pkt_print_track:203 proto=1, src=10.0.1.25:15165, dst=172.16.3.3:0, tos=0, reason "ipv4_route_lookup_fail", count 17, path "29:pkt_path_ipv4_read 3 28 29 50 51 52 63 96" 2023-03-10T01:09:40.764 INFO [NET] vc_pkt_print_track:203 proto=1, src=10.0.1.25:15165, dst=172.16.3.3:0, tos=0, reason "ipv4_route_lookup_fail", count 16, path "29:pkt_path_ipv4_read 3 28 29 50 51 52 63 96" 2023-03-10T01:09:41.764 INFO [NET] vc_pkt_print_track:203 proto=1, src=10.0.1.25:15165, dst=172.16.3.3:0, tos=0, reason "ipv4_route_lookup_fail", count 15, path "29:pkt_path_ipv4_read 3 28 29 50 51 52 63 96" 2023-03-10T01:09:42.764 INFO [NET] vc_pkt_print_track:203 proto=1, src=10.0.1.25:15165, dst=172.16.3.3:0, tos=0, reason "ipv4_route_lookup_fail", count 14, path "29:pkt_path_ipv4_read 3 28 29 50 51 52 63 96"
This is observed in the setups where the next hop points to either a “Gateway or a Hub”.
However, when either a source edge or a remote edge is a Hub, and the "Branch to Branch" configuration under "Cloud VPN" is via the Hubs, this would not be observed since the false route on the source edge would be deleted once the path would be down.
When the remote subnet is a connected one, and if the interface connecting them is physically down or the remote edge is turned off, this route is marked False on all the other SDWAN Edges that are receiving this route.
This route never gets deleted in the source edge's routing table due to which the traffic that is initiated behind this edge tries to take the false route and gets dropped on the source edge itself.
Although a less specific route exists in TRUE state, it does not fail back.
This is due to route backtracking not being enabled today by default.
Enabling backtracking is added as a knob through the new VCO UI -> Global Settings -> Customer configuration.
This feature is available for use from 5.2.0.0 version onwards.
If the SDWAN-VCE and the VCO are not on the resolved version 5.2.0.0, to delete the routes marked as FALSE, a restart of the device is needed.
To restart the unit, from the orchestrator click Remote Actions - Select the device - Restart service
A restart of the VCE impacts the production traffic. Please apply the workaround during non-business hours.