Site-to-site connectivity issues for remote connected subnets in VMware SD-WAN
search cancel

Site-to-site connectivity issues for remote connected subnets in VMware SD-WAN

book

Article ID: 344866

calendar_today

Updated On:

Products

VMware

Issue/Introduction

The purpose of this article is to explain the current design and provide a workaround for customers who are experiencing this issue.


Symptoms:

Site-to-site connectivity is broken for a specific set of subnets.

The remote subnets to which the traffic might fail could be connected subnets to the destination device.

This is observed in setups where you might expect a fallback of traffic from one SD-WAN site to another which is advertising a less specific prefix.

Orchestrator Remote diagnostics - Route Table Dump shows the impacted route reachability as FALSE.

Let's consider the below setup.

Route 172.16.3.0/29 is advertised to the overlay as a connected subnet behind b3-edge1 
Route 172.16.3.0/28 is advertised to the overlay as a static subnet behind b2-edge1


Screenshot 2023-03-09 005449.png


b1-edge1 learns these two routes from overlay. The above setup consists of two gateways hence we see two routes one from each. 

image.png
image.png

CLI output: (For on-premise customers who manage VCEs on their own)

edge:b1-edge1(active):~# debug.py --routes | grep -Ei "Address|172.16.3.0"
Address                 Netmask        Type       Gateway  Next Hop Name                           Next Hop ID  Destination Name                         Dst LogicalId  Reachable  Metric  Preference  Flags   Vlan           Intf  Sub IntfId   MTU  SEG
172.16.3.0      255.255.255.248   edge2edge           any      gateway-1  d2fe9e58-3deb-48c6-94c8-0c2154c36eb4          b3-edge1  fd6b6ead-f7a0-4d6c-864d-6907747a1f25       True       0           0     SR      0            any         N/A  1500    0
172.16.3.0      255.255.255.248   edge2edge           any      gateway-2  327aad1c-1cfe-4a6e-83a5-3c6d85a3918d          b3-edge1  fd6b6ead-f7a0-4d6c-864d-6907747a1f25       True       0           0     SR      0            any         N/A  1500    0
172.16.3.0      255.255.255.240   edge2edge           any      gateway-1  d2fe9e58-3deb-48c6-94c8-0c2154c36eb4          b2-edge1  fac9c633-bee2-4fca-b633-f2b70a7c1a34       True       0           0     SR  65535            any         N/A  1500    0
172.16.3.0      255.255.255.240   edge2edge           any      gateway-2  327aad1c-1cfe-4a6e-83a5-3c6d85a3918d          b2-edge1  fac9c633-bee2-4fca-b633-f2b70a7c1a34       True       0           0     SR  65535            any         N/A  1500    0



When the remote edge goes offline or is powered off or is deactivated, the route would turn "FALSE".

Traffic initiated from the clients behind b1-edge1 is going to be dropped on b1-edge1 as it hits the “FALSE” route although there is a prefix 172.16.2.0/28 being learnt via b2-edge1



Output on b1-edge1

image.png
image.png

 

edge:b1-edge1(active):~# debug.py --routes | grep -Ei "Address|172.16.3.0"
Address                 Netmask        Type       Gateway  Next Hop Name                           Next Hop ID  Destination Name                         Dst LogicalId  Reachable  Metric  Preference  Flags   Vlan           Intf  Sub IntfId   MTU  SEG
172.16.3.0      255.255.255.248   edge2edge           any      gateway-1  d2fe9e58-3deb-48c6-94c8-0c2154c36eb4          b3-edge1  fd6b6ead-f7a0-4d6c-864d-6907747a1f25      False       0           0     SR      0            any         N/A  1500    0
172.16.3.0      255.255.255.248   edge2edge           any      gateway-2  327aad1c-1cfe-4a6e-83a5-3c6d85a3918d          b3-edge1  fd6b6ead-f7a0-4d6c-864d-6907747a1f25      False       0           0     SR      0            any         N/A  1500    0
172.16.3.0      255.255.255.240   edge2edge           any      gateway-1  d2fe9e58-3deb-48c6-94c8-0c2154c36eb4          b2-edge1  fac9c633-bee2-4fca-b633-f2b70a7c1a34       True       0           0     SR  65535            any         N/A  1500    0
172.16.3.0      255.255.255.240   edge2edge           any      gateway-2  327aad1c-1cfe-4a6e-83a5-3c6d85a3918d          b2-edge1  fac9c633-bee2-4fca-b633-f2b70a7c1a34       True       0           0     SR  65535            any         N/A  1500    0




Initiated ICMP traffic from a client behind b1-edge1 towards 172.16.3.3
On CLI, we would be able to see the below drops when a packet tracker is configured.

edge:b1-edge1(active):~# debug.py --pkt_tracker 10.0.1.25 any 172.16.3.3 any 1 20
{'count': 20, 'sip': '10.0.1.25', 'proto': '1', 'dport': 'any', 'debug': 'pkt_track', 'sport': 'any', 'dip': '172.16.3.3'}
{
  "Success": "Logging of packets started"
}



2023-03-10T01:09:39.764 INFO    [NET] vc_pkt_print_track:203 proto=1, src=10.0.1.25:15165, dst=172.16.3.3:0, tos=0, reason "ipv4_route_lookup_fail", count 17, path "29:pkt_path_ipv4_read 3 28 29 50 51 52 63 96"
2023-03-10T01:09:40.764 INFO    [NET] vc_pkt_print_track:203 proto=1, src=10.0.1.25:15165, dst=172.16.3.3:0, tos=0, reason "ipv4_route_lookup_fail", count 16, path "29:pkt_path_ipv4_read 3 28 29 50 51 52 63 96"
2023-03-10T01:09:41.764 INFO    [NET] vc_pkt_print_track:203 proto=1, src=10.0.1.25:15165, dst=172.16.3.3:0, tos=0, reason "ipv4_route_lookup_fail", count 15, path "29:pkt_path_ipv4_read 3 28 29 50 51 52 63 96"
2023-03-10T01:09:42.764 INFO    [NET] vc_pkt_print_track:203 proto=1, src=10.0.1.25:15165, dst=172.16.3.3:0, tos=0, reason "ipv4_route_lookup_fail", count 14, path "29:pkt_path_ipv4_read 3 28 29 50 51 52 63 96"





This is observed in the setups where the next hop points to either a “Gateway or a Hub”.


However, when either a source edge or a remote edge is a Hub, and the "Branch to Branch" configuration under "Cloud VPN" is via the Hubs, this would not be observed since the false route on the source edge would be deleted once the path would be down.


Environment

VMware SD-WAN

Cause

When the remote subnet is a connected one, and if the interface connecting them is physically down or the remote edge is turned off, this route is marked False on all the other SDWAN Edges that are receiving this route.

This route never gets deleted in the source edge's routing table due to which the traffic that is initiated behind this edge tries to take the false route and gets dropped on the source edge itself.

Although a less specific route exists in TRUE state, it does not fail back.

This is due to route backtracking not being enabled today by default.



 

Resolution

Enabling backtracking is added as a knob through the new VCO UI -> Global Settings -> Customer configuration.

This feature is available for use from 5.2.0.0 version onwards.

image.png

 

 


Workaround:

If the SDWAN-VCE and the VCO are not on the resolved version 5.2.0.0, to delete the routes marked as FALSE, a restart of the device is needed.

To restart the unit, from the orchestrator click Remote Actions - Select the device - Restart service


Additional Information

Impact/Risks:

A restart of the VCE impacts the production traffic. Please apply the workaround during non-business hours.