Users experience sudden, brief performance degradation, latency, or packet loss across the NSX overlay network.
NSX Edge Geneve tunnels transition from UP to DOWN simultaneously across multiple remote ESXi/Edge TEP endpoints.NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="tunnel" level="INFO"] Tunnel <source_endpoint_ip>:<remote_endpoint_ip>(geneve) state updated from up to down
From the Edge /var/log/syslog it indicates sudden session teardowns with the following diagnostic messages:NSX 12314 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="bfd" tname="dp-bfd-mon2" level="INFO"] <source_endpoint_ip>-><remote_endpoint_ip>/geneve: BFD state change: up->down "No Diagnostic"->"Neighbor Signaled Session Down".
/var/run/log/vmkernel.log we see the diagnostic messages related to Control Detection Time Expired vmkernel: cpu9:13555140)BFD_HandleStatusChange:857:[nsx@6876 comp="nsx-esx" subcomp="bfd"]local: ###.###.###.###, remote: ###.###.###.###, oldState: up, newState: down, diag: Control Detection Time Expired, type: overlayEdge Datapath virtual switch statistics for the overlay TEP interface show a high number of buffer drops, specifically 1stRing Full and OutOf Buffers.
Other NSX Edge nodes in the same Edge Cluster remain completely stable and maintain their BFD connections.
Checking the vmkernel.log of the specific ESXi host where the affected Edge VM resides reveals hardware memory errors exactly 2 to 3 seconds before the BFD tunnel drops: vmkernel: cpu8:2097563)ApeiPageRetire: 730: Processing HEST GESB, severity 0x2, with 1 GEDE record(s) vmkernel: cpu8:2097563)ApeiPageRetire: 654: Memory error 1: val=3c3fb err=400 adr=7534d####...
VMware NSX
This issue is caused by a hardware fault (a failing memory DIMM) on the underlying physical ESXi server hosting one of the active NSX Edge VMs.
When the physical server motherboard detects a hardware memory error, it triggers a System Management Interrupt (SMI) to log the Machine Check Exception (MCE) and instructs ESXi to safely retire the bad memory page (ApeiPageRetire). SMIs execute directly in the hardware firmware and completely bypass the OS, resulting in a temporary freeze (micro-stun) of the ESXi hypervisor and all resident virtual machines.
Because the NSX Edge VM is frozen by the hardware during this process, its DPDK datapath is starved of CPU cycles. The Edge is unable to transmit or process BFD keepalive packets. Once the BFD timeout (typically 1.5 seconds) expires, the remote transport nodes declare the session dead and tear down the Geneve overlay tunnels, causing a temporary datapath outage.
This is not an NSX or ESXi software defect. It is a physical hardware failure on the underlying server. To resolve the issue permanently:
For more general information regarding the ESXi hypervisor's handling of these memory messages, please refer to: