Edge Nodes lost connectivity to all BGP peers during single TOR switch upgrade in a redundant switch setup.

Products

VMware vSphere ESXi VMware NSX

Issue/Introduction

During a scheduled upgrade of a single Top-of-Rack (TOR1) switch, Edge nodes lost connectivity to all BGP peers across both TOR1 and the redundant TOR2 switch.
Each switch has their own dedicated BGP peers configured, and it was expected that TOR1's specific BGP peers would be down while it was being upgraded.
As per the environment architecture, it was expected that the redundant switch (TOR2) keeps network traffic flowing through its own BGP peers.
However, all upstream connectivity failed including connectivity to the BGP peers across both redundant switches and a complete loss of VM network connectivity via the Tier-0 and VRF gateways.
Example:
- TOR1 switch BGP peers:
  - 192.168.1.254/24
  - 192.168.2.254/24
  - 192.168.3.254/24
- TOR2 Switch BGP peers:
  - 192.168.6.254/24
  - 192.168.7.254/24
  - 192.168.8.254/24
When TOR1 switch is getting upgraded, connectivity to peers, 192.168.1.254/2.254/3.254 will be lost as expected; however connectivity failed to the BGP peers, 192.168.6.254/7.254/8.254 on TOR2.
Edge node reporting all the BGP peers as down,

YYYY-MM-DDTHH:MM:SS.722Z <Edge Node> NSX 10679 ROUTING [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="bgp-adapter" level="INFO"] BGP State Update - VRF:<VRF-ID> DST:192.168.1.254 State:DOWN
YYYY-MM-DDTHH:MM:SS.331Z <Edge Node> NSX 10679 ROUTING [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="bgp-adapter" level="INFO"] BGP State Update - VRF:<VRF-ID> DST:192.168.2.254 State:DOWN
YYYY-MM-DDTHH:MM:SS.779Z <Edge Node> NSX 10679 ROUTING [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="bgp-adapter" level="INFO"] BGP State Update - VRF:<VRF-ID> DST:192.168.3.254 State:DOWN
....
YYYY-MM-DDTHH:MM:SS.498Z <Edge Node> NSX 10679 ROUTING [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="bgp-adapter" level="INFO"] BGP State Update - VRF:<VRF-ID> DST:192.168.6.254 State:DOWN
YYYY-MM-DDTHH:MM:SS.814Z <Edge Node> NSX 10679 ROUTING [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="bgp-adapter" level="INFO"] BGP State Update - VRF:<VRF-ID> DST:192.168.7.254 State:DOWN
YYYY-MM-DDTHH:MM:SS.097Z <Edge Node> NSX 10679 ROUTING [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="bgp-adapter" level="INFO"] BGP State Update - VRF:<VRF-ID> DST:192.168.8.254 State:DOWN

Review of the ESXi host logs (/var/run/log/vobd.log) from nodes confirms physical adapter link states flapping only for the adapters connected to the TOR1 switch during the upgrade window.

YYYY-MM-DDTHH:MM:SS.758Z In(14) vobd[2098003]: [netCorrelator] 982132523547us: [vob.net.vmnic.linkstate.down] vmnic vmnic# linkstate down
YYYY-MM-DDTHH:MM:SS.755Z In(14) vobd[2098003]: [netCorrelator] 982988518424us: [vob.net.vmnic.linkstate.up] vmnic vmnic# linkstate up
YYYY-MM-DDTHH:MM:SS.564Z In(14) vobd[2098003]: [netCorrelator] 983112326637us: [vob.net.vmnic.linkstate.down] vmnic vmnic# linkstate down
YYYY-MM-DDTHH:MM:SS.457Z In(14) vobd[2098003]: [netCorrelator] 983216219199us: [vob.net.vmnic.linkstate.up] vmnic vmnic# linkstate up

These Edge VMs are deployed on a vSAN cluster and during the upgrade window, ESXi host logs reported all ESXi hosts going into vSAN isolation.

var/run/log/clomd.log:

YYYY-MM-DDTHH:MM:SS.679Z No(29) clomd[2099646]: [Originator@6876] clomdb-CdbHandleRemoveEntry: Removing <ESXi host vSAN UUID> of type CdbObjectNode from CLOMDB.
YYYY-MM-DDTHH:MM:SS.429Z No(29) clomd[2099646]: [Originator@6876] clomdb-CdbHandleRemoveEntry: Removing <ESXi host vSAN UUID> of type CdbObjectNode from CLOMDB.
YYYY-MM-DDTHH:MM:SS.429Z No(29) clomd[2099646]: [Originator@6876] clomdb-CdbHandleRemoveEntry: Removing <ESXi host vSAN UUID> of type CdbObjectNode from CLOMDB.
YYYY-MM-DDTHH:MM:SS.429Z No(29) clomd[2099646]: [Originator@6876] clomdb-CdbHandleRemoveEntry: Removing <ESXi host vSAN UUID> of type CdbObjectNode from CLOMDB.
YYYY-MM-DDTHH:MM:SS.929Z No(29) clomd[2099646]: [Originator@6876] clomdb-CdbHandleRemoveEntry: Removing <ESXi host vSAN UUID> of type CdbObjectNode from CLOMDB.

The vSAN networking was configured to use adapters connected to both TOR1 and TOR2 switch. Hence, if adapter connected to TOR1 had gone down, connectivity via TOR2 should have remained up.
But during the window, there was a logical reset performed for all the adapters on the ESXi hosts across the cluster, which included the adapters connected to TOR1 as well as TOR2 switch.

var/log/vmkernel.log:

YYYY-MM-DDTHH:MM:SS.293Z In(182) vmkernel: cpu102:2102063 opID=fb635ae6)Uplink: 18074: vmnic#: set flags 0x49e0e DEVICE_REENABLING
YYYY-MM-DDTHH:MM:SS.294Z In(182) vmkernel: cpu102:2102063 opID=fb635ae6)Uplink: 18212: vmnic#: clear flags 0x41e0e DEVICE_REENABLING
YYYY-MM-DDTHH:MM:SS.294Z In(182) vmkernel: cpu102:2102063 opID=fb635ae6)Uplink: 18074: vmnic#: set flags 0x49e0e DEVICE_REENABLING
YYYY-MM-DDTHH:MM:SS.294Z In(182) vmkernel: cpu102:2102063 opID=fb635ae6)Uplink: 18212: vmnic#: clear flags 0x41e0e DEVICE_REENABLING
YYYY-MM-DDTHH:MM:SS.294Z In(182) vmkernel: cpu102:2102063 opID=fb635ae6)Uplink: 18074: vmnic#: set flags 0x49e0e DEVICE_REENABLING
....
YYYY-MM-DDTHH:MM:SS.351Z In(182) vmkernel: cpu102:2102063 opID=fb635ae6)Uplink: 18074: vmnic#: set flags 0x49e0e DEVICE_REENABLING
YYYY-MM-DDTHH:MM:SS.352Z In(182) vmkernel: cpu102:2102063 opID=fb635ae6)Uplink: 18212: vmnic#: clear flags 0x41e0e DEVICE_REENABLING

The vSAN isolation occurred due to logical reset being performed across all the adapters. And this resulted in the Edge VMs behaving erratically and moving all the peers to the Down state impacting VM connectivity.

Environment

VMware NSX
VMware vSphere ESXi

Cause

The upstream physical switch triggered a logical reset of all physical network adapters on all the ESXi hosts across the redundant switches.

This simultaneous reset of all uplinks caused temporary vSAN isolation across the cluster, leading to the Edge VMs dropping all active BGP sessions, including those routed through the redundant TOR switches.

Host logs confirm DEVICE_REENABLING flags were set and cleared for all vmnics simultaneously, despite adapters connected to the redundant switch not reporting a linkstate down event.

Resolution

Please engage the physical networking and the server team to validate further the events reported during the switch upgrade window.

The events reported during the window should help clarify any failures/faults that might have resulted in a reset being performed across all the redundant switch interfaces.