No route on the active DLR Control VM after a HA Failover
search cancel

No route on the active DLR Control VM after a HA Failover

book

Article ID: 327332

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • An outage occurs due to the default route missing on the DLR after a HA event.
  • Entries similar to the below indicating a routing socket error observed in vshield_edge_msr_logs on DLR Control VM support bundle.

    Sample logs in vshield_edge_msr_logs
    **** PROBLEM 0x0309 - 6 (0000) **** I:00002157 F:00000001
    i3lx.c 421 :at 01:10:10, 22 November 2021 (517481444 ms)
    Interface Information stub failed to process a routing message because
    a recv() call on a routing socket failed.
    LSR Index = 1
    Recv errno = 88

    **** PROBLEM 0x0309 - 6 (0000) **** I:00002465 F:00000001
    i3lx.c 421 :at 02:06:12, 22 November 2021 (520843804 ms)
    Interface Information stub failed to process a routing message because
    a recv() call on a routing socket failed.
    LSR Index = 1
    Recv errno
  • VMCI channel flaps observed in /var/run/log/netcpa.log on the ESXi host.

    Sample logs in /var/run/log/netcpa.log
    Host A
    T21:03:46.994Z [ 7BBE700 error ] recv error: 0:Success
    T21:03:46.994Z [ 7BBE700 info ] Vdrb: vmci link down, fd = 27


    Host B
    T21:03:55.064Z [ D6B82700 info ] Vdrb: vmci link up, fd = 24
    T21:03:55.064Z [ D6B82700 info ] Sent edge link up to kernel


Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware NSX Data Center for vSphere

Resolution


This is a known issue impacting VMware NSX Data Center for vSphere.
 

Workaround
 
In order to identify the root cause of this issue, please gather the below information while the issue is present:
 
  • Log bundles:

    1. NSX Manager Logs.
    2. Controller Logs.
    3. Edge Logs.
    4. Log bundle for ESXi hosts the Edge Nodes resided on.
    5. Log bundle for the ESXi hosts an impacted VM resided on.

  • Collect the ipstrc.log file:
    1. On the Edge CLI, run “debug routing” as the admin user.
    2. Wait 60 seconds.
    3. Collect /var/log/msr/ipstrc.log
    4. Disable log collection by running “no debug routing” on the the Edge CLI.

  • Then follow the below instructions:
    1. One more config push from the Management Plane (for instance toggle BGP enable/disable)
    2. Reboot the impacted node, this will trigger a failover.