TOR1 switch BGP peers:192.168.1.254/24192.168.2.254/24192.168.3.254/24TOR2 Switch BGP peers:192.168.6.254/24192.168.7.254/24192.168.8.254/24192.168.1.254/2.254/3.254 will be lost as expected; however connectivity failed to the BGP peers, 192.168.6.254/7.254/8.254 on TOR2.YYYY-MM-DDTHH:MM:SS.722Z <Edge Node> NSX 10679 ROUTING [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="bgp-adapter" level="INFO"] BGP State Update - VRF:<VRF-ID> DST:192.168.1.254 State:DOWNYYYY-MM-DDTHH:MM:SS.331Z <Edge Node> NSX 10679 ROUTING [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="bgp-adapter" level="INFO"] BGP State Update - VRF:<VRF-ID> DST:192.168.2.254 State:DOWNYYYY-MM-DDTHH:MM:SS.779Z <Edge Node> NSX 10679 ROUTING [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="bgp-adapter" level="INFO"] BGP State Update - VRF:<VRF-ID> DST:192.168.3.254 State:DOWN....YYYY-MM-DDTHH:MM:SS.498Z <Edge Node> NSX 10679 ROUTING [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="bgp-adapter" level="INFO"] BGP State Update - VRF:<VRF-ID> DST:192.168.6.254 State:DOWNYYYY-MM-DDTHH:MM:SS.814Z <Edge Node> NSX 10679 ROUTING [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="bgp-adapter" level="INFO"] BGP State Update - VRF:<VRF-ID> DST:192.168.7.254 State:DOWNYYYY-MM-DDTHH:MM:SS.097Z <Edge Node> NSX 10679 ROUTING [nsx@6876 comp="nsx-edge" subcomp="rcpm" s2comp="bgp-adapter" level="INFO"] BGP State Update - VRF:<VRF-ID> DST:192.168.8.254 State:DOWN
/var/run/log/vobd.log) from nodes confirms physical adapter link states flapping only for the adapters connected to the TOR1 switch during the upgrade window. YYYY-MM-DDTHH:MM:SS.758Z In(14) vobd[2098003]: [netCorrelator] 982132523547us: [vob.net.vmnic.linkstate.down] vmnic vmnic# linkstate downYYYY-MM-DDTHH:MM:SS.755Z In(14) vobd[2098003]: [netCorrelator] 982988518424us: [vob.net.vmnic.linkstate.up] vmnic vmnic# linkstate upYYYY-MM-DDTHH:MM:SS.564Z In(14) vobd[2098003]: [netCorrelator] 983112326637us: [vob.net.vmnic.linkstate.down] vmnic vmnic# linkstate downYYYY-MM-DDTHH:MM:SS.457Z In(14) vobd[2098003]: [netCorrelator] 983216219199us: [vob.net.vmnic.linkstate.up] vmnic vmnic# linkstate up
var/run/log/clomd.log:
YYYY-MM-DDTHH:MM:SS.679Z No(29) clomd[2099646]: [Originator@6876] clomdb-CdbHandleRemoveEntry: Removing <ESXi host vSAN UUID> of type CdbObjectNode from CLOMDB.YYYY-MM-DDTHH:MM:SS.429Z No(29) clomd[2099646]: [Originator@6876] clomdb-CdbHandleRemoveEntry: Removing <ESXi host vSAN UUID> of type CdbObjectNode from CLOMDB.YYYY-MM-DDTHH:MM:SS.429Z No(29) clomd[2099646]: [Originator@6876] clomdb-CdbHandleRemoveEntry: Removing <ESXi host vSAN UUID> of type CdbObjectNode from CLOMDB.YYYY-MM-DDTHH:MM:SS.429Z No(29) clomd[2099646]: [Originator@6876] clomdb-CdbHandleRemoveEntry: Removing <ESXi host vSAN UUID> of type CdbObjectNode from CLOMDB.YYYY-MM-DDTHH:MM:SS.929Z No(29) clomd[2099646]: [Originator@6876] clomdb-CdbHandleRemoveEntry: Removing <ESXi host vSAN UUID> of type CdbObjectNode from CLOMDB.
var/log/vmkernel.log:
YYYY-MM-DDTHH:MM:SS.293Z In(182) vmkernel: cpu102:2102063 opID=fb635ae6)Uplink: 18074: vmnic#: set flags 0x49e0e DEVICE_REENABLINGYYYY-MM-DDTHH:MM:SS.294Z In(182) vmkernel: cpu102:2102063 opID=fb635ae6)Uplink: 18212: vmnic#: clear flags 0x41e0e DEVICE_REENABLINGYYYY-MM-DDTHH:MM:SS.294Z In(182) vmkernel: cpu102:2102063 opID=fb635ae6)Uplink: 18074: vmnic#: set flags 0x49e0e DEVICE_REENABLINGYYYY-MM-DDTHH:MM:SS.294Z In(182) vmkernel: cpu102:2102063 opID=fb635ae6)Uplink: 18212: vmnic#: clear flags 0x41e0e DEVICE_REENABLINGYYYY-MM-DDTHH:MM:SS.294Z In(182) vmkernel: cpu102:2102063 opID=fb635ae6)Uplink: 18074: vmnic#: set flags 0x49e0e DEVICE_REENABLING....YYYY-MM-DDTHH:MM:SS.351Z In(182) vmkernel: cpu102:2102063 opID=fb635ae6)Uplink: 18074: vmnic#: set flags 0x49e0e DEVICE_REENABLINGYYYY-MM-DDTHH:MM:SS.352Z In(182) vmkernel: cpu102:2102063 opID=fb635ae6)Uplink: 18212: vmnic#: clear flags 0x41e0e DEVICE_REENABLING
VMware NSX
VMware vSphere ESXi
The upstream physical switch triggered a logical reset of all physical network adapters on all the ESXi hosts across the redundant switches.
This simultaneous reset of all uplinks caused temporary vSAN isolation across the cluster, leading to the Edge VMs dropping all active BGP sessions, including those routed through the redundant TOR switches.
Host logs confirm DEVICE_REENABLING flags were set and cleared for all vmnics simultaneously, despite adapters connected to the redundant switch not reporting a linkstate down event.
Please engage the physical networking and the server team to validate further the events reported during the switch upgrade window.
The events reported during the window should help clarify any failures/faults that might have resulted in a reset being performed across all the redundant switch interfaces.