Inter VRF routes remain post removal of inter-vrf config.
search cancel

Inter VRF routes remain post removal of inter-vrf config.

book

Article ID: 375950

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

  • Automated scripting of the removal / tear down of VRFs and inter-vrf config is present in the environment.
  • Invalid stale routes remain on the parent T0 that have been system generated by the inter-vrf config function (The inter-vrf function added in NSX T 4.1 - NSX T 4.1 Admin guide - Inter VRF routing , this does not impact inter-vrf route leaking) after the VRF config and the VRF has been deleted.
  • These routes may cause intermittent traffic interruption for traffic that matches these stale routes.
  • These routes can be observed as below:

    • When downloading the csv of the routes from the t0 via the GUI they will appear as the examples below. The below is an example where a valid and a stale route is present. The ivs route type denotes an inter-vrf static route type. The subnet is 192.168.1.0/24 and the next hops are 169.254.2.8 (Invalid) and 169.254.2.10 (valid).  As you can see the VRF UUID is missing from the invalid one as it has been deleted.

"/infra/sites/default/enforcement-points/default/edge-clusters/<EDGE CLUSTER ID>/edge-nodes/0",ivs,192.168.1.0/24,,169.254.2.8,5,"012abcde-54ab-1234-5678-abcdef123456","CCP_ROUTER_TYPE_SERVICE_ROUTER_TIER0",,false         <<< Invalid Route of removed VRF with invalid next hop IP and VRF UUID missing.

"/infra/sites/default/enforcement-points/default/edge-clusters/<EDGE CLUSTER ID>/edge-nodes/0",ivs,192.168.1.0/24,,169.254.2.10,5,"012abcde-54ab-1234-5678-abcdef123456","CCP_ROUTER_TYPE_SERVICE_ROUTER_TIER0","/infra/tier-0s/<VRF UUID>
",false  

    • When checking the t0 routes via API or from the logs we can see next_hop_gateway is missing for the stale entries.                              

"route_type": "ivs",
"network": "192.168.1.0/24",
"admin_distance": 5,
"next_hop": "169.254.2.10",
"lr_component_id": "<LR UUID>",
"lr_component_type": "CCP_ROUTER_TYPE_SERVICE_ROUTER_TIER0",
"next_hop_gateway": "/infra/tier-0s/<VRF UUID>", <<< Next hop gateway is present on active valid VRF
"black_hole": false

"route_type": "ivs",
"network": "192.168.1.0/24",
"admin_distance": 5,
"next_hop": "169.254.2.8",
"lr_component_id": "<LR UUID>",
"lr_component_type": "CCP_ROUTER_TYPE_SERVICE_ROUTER_TIER0", << Next hop gateway is not present on invalid route
"black_hole": false

Environment

  • NSX 4.1.x and 4.2.0 
  • Scripted automation of VRF and inter-vrf routing config and removal is in use (such as Terraform).

Cause

Due to a race condition in the timing of deletes and the realisation of that delete. If a VRF with inter-vrf routing is deleted immediately following its inter-vrf routing config removal it can cause the VRF to be deleted before the inter-vrf config that was attached to it can be fully removed. This leaves the routes generated from the configuration still attached to the parent T0 now pointing to a VRF that no longer exists. This occurs because scripted automation triggers the API removal calls faster than they can be fully processed and realised.

As this is related to a timing issue it will only occur when these actions are carried out extremely quickly, normally via a scripted removal process such as terraform or ansible.

Resolution

This issue will be fixed in upcoming NSX releases.

Workaround:

  • To prevent the issue, a small delay (1-2 seconds) is required between the removal of the inter-vrf routing configuration and deletion of the VRF. The scripting process you are using may be able to add a small delay in between the removal of the inter-vrf routing config and the deletion of the VRF it references. Alternatively these actions need to be separated into two separate scripted actions that are triggered with a pause after the inter-vrf removal.
  • Once a stale route is present, it can not be removed via API or GUI. In this case, please open a case with Global Support team to investigate and remove the routes. When opening the ticket please include the NSX management cluster log bundle and details of the stale route.