Routes are not being learned on secondary site ESXi host(s) UDLR instances
search cancel

Routes are not being learned on secondary site ESXi host(s) UDLR instances

book

Article ID: 330281

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
  • You are using NSX-V with a Cross-vCenter environment
  • East-West traffic is not effected
  • The issue effects newly added hosts located in the secondary site but only routes listed in the UDLR routing tables.
 
  • Check the routes on the UDLR
    1. Use the command show ip route from the console of the DLR control VM. You should see the routes listed
image.png
  • Check the routes on the ESXi host 
    1. Use net-vdr -l -I to list the VDR name required for the next command
[root@HQ-ESXi-Prod-01a:~] net-vdr -l -I
VDR Instance Information :
---------------------------
Vdr Name:                   edge-1
 
  1. Next list the routes for the DLR instance running on the host using net-vdr --route -l edge-XXX - This will list the routes available on that host
[root@HQ-ESXi-Prod-01a:~] net-vdr --route -l edge-1

VDR edge-1 Route Table
Legend: [U: Up], [G: Gateway], [C: Connected], [I: Interface]
Legend: [H: Host], [B: Blackhole], [F: Soft Flush] [!: Reject] [E: ECMP]

Destination      GenMask          Gateway          Flags    Ref Origin   UpTime     Interface        HitCount
-----------      -------          -------          -----    --- ------   ------     ---------        --------
192.168.120.0    255.255.255.0    192.168.250.251  UG       1   AUTO     952989     138800000004     3
192.168.200.0    255.255.255.0    0.0.0.0          UCI      1   MANUAL   1116451    13880000000a     1
  • You will note on the problematic host the routes are not updating. This is caused by the hosts not learning the BGP routes
 
  • For a problematic host, the localeid is all 0's
  1. Use net-vdr -C -l on the host to list the host locale Id
Host locale Id: 00000000-0000-0000-0000-000000000000

NOTE: The preceding log excerpts are only examples. Date, time and environmental variables may vary depending on your environment.

Cause

This is caused by the config-by-vsm.xml having a file size of 0 KB
  • To confirm browse to the below directory on the effected host /etc/vmware/netcpa
     [root@HQ-ESXi-Prod-01a:/etc/vmware/netcpa] ls -l
total 16
-rw-r--r--    1 root     root           0 Sep  2 14:52 config-by-vsm.xml
-rw-r--r-T    1 root     root          4228 Aug  2  2020 netcpa.xml
-rw-r--r--    1 root     root           545 Aug 18 11:09 tunable.xml
  • Note the config-by-vsm.xml is listed as 0kb
  When Verbose logging level is enabled for netcpa the below can be seen:
  • Searching for "_configNode is NULL"  in var/log/netcpa.log we can see the following stating the file is empty:
2020-11-26T10:43:46.801Z [ 34BAA700 verbose ] localeId JSON is "421D5A9D-65BB-5B71-E3B0-DB6FB6E0F45A"
2020-11-26T10:43:46.801Z [ 34BAA700 verbose ] _configNode is NULL

Resolution

This issue has been resolved in 6.4.11

Workaround:
Remove ESXi host from the NSX cluster and add it back