No route on the active DLR Edge after the HA Failover
search cancel

No route on the active DLR Edge after the HA Failover

book

Article ID: 327332

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Symptoms:
Outage experience due to default route missing on the DLR after HA event

msr logs below and vmci channel flaps
# vmci channel flaps:
ESXi 139
T21:03:46.994Z [ 7BBE700 error ] recv error: 0:Success
T21:03:46.994Z [ 7BBE700 info ] Vdrb: vmci link down, fd = 27
ESXi 138
T21:03:55.064Z [ D6B82700 info ] Vdrb: vmci link up, fd = 24
T21:03:55.064Z [ D6B82700 info ] Sent edge link up to kernel

# routing socket errors:
**** PROBLEM 0x0309 - 6 (0000) **** I:00002157 F:00000001
i3lx.c 421 :at 01:10:10, 22 November 2021 (517481444 ms)
Interface Information stub failed to process a routing message because
a recv() call on a routing socket failed.
LSR Index = 1
Recv errno = 88
**** PROBLEM 0x0309 - 6 (0000) **** I:00002465 F:00000001
i3lx.c 421 :at 02:06:12, 22 November 2021 (520843804 ms)
Interface Information stub failed to process a routing message because
a recv() call on a routing socket failed.
LSR Index = 1
Recv errno


Cause

At this moment, the best information we have is that control VM was not able to read the netlink messages for a few hours. And if no changes were made to the system to lead to this. It is possibly the system built up to this situation over a period of time.

Resolution



Workaround:
When problem occurs and before applying the workaround:
Collect the ipstrc.log:
In CLI, run “debug routing” as admin
Wait for about a minute
Collect /var/log/msr/ipstrc.log
Disable log collection “no debug routing”

WORKAROUND
1) One more config push from Management plane(like toggle BGP enable/disable)
2) Reboot
3) Do the failover.