NSX Edge Geneve tunnels and BFD sessions flap or drop due to ESXi host hardware memory faults (ApeiPageRetire)
search cancel

NSX Edge Geneve tunnels and BFD sessions flap or drop due to ESXi host hardware memory faults (ApeiPageRetire)

book

Article ID: 438022

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

 

  • Users experience sudden, brief performance degradation, latency, or packet loss across the NSX overlay network.

  • NSX Edge Geneve tunnels transition from UP to DOWN simultaneously across multiple remote ESXi/Edge TEP endpoints.
    NSX 1 FABRIC [nsx@6876 comp="nsx-edge" subcomp="nsxa" s2comp="tunnel" level="INFO"] Tunnel <source_endpoint_ip>:<remote_endpoint_ip>(geneve) state updated from up to down

  • From the Edge /var/log/syslog it indicates sudden session teardowns with the following diagnostic messages:
    NSX 12314 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="bfd" tname="dp-bfd-mon2" level="INFO"] <source_endpoint_ip>-><remote_endpoint_ip>/geneve: BFD state change: up->down "No Diagnostic"->"Neighbor Signaled Session Down". 

  • From the remote Transport Node (TN) in /var/run/log/vmkernel.log we see the diagnostic messages related to Control Detection Time Expired 
    vmkernel: cpu9:13555140)BFD_HandleStatusChange:857:[nsx@6876 comp="nsx-esx" subcomp="bfd"]local: ###.###.###.###, remote: ###.###.###.###, oldState: up, newState: down, diag: Control Detection Time Expired, type: overlay
  • Edge Datapath virtual switch statistics for the overlay TEP interface show a high number of buffer drops, specifically 1stRing Full and OutOf Buffers.

  • Other NSX Edge nodes in the same Edge Cluster remain completely stable and maintain their BFD connections.

  • Checking the vmkernel.log of the specific ESXi host where the affected Edge VM resides reveals hardware memory errors exactly 2 to 3 seconds before the BFD tunnel drops:
    vmkernel: cpu8:2097563)ApeiPageRetire: 730: Processing HEST GESB, severity 0x2, with 1 GEDE record(s)
    vmkernel: cpu8:2097563)ApeiPageRetire: 654: Memory error 1: val=3c3fb err=400 adr=7534d####...

 

Environment

VMware NSX

Cause

This issue is caused by a hardware fault (a failing memory DIMM) on the underlying physical ESXi server hosting one of the active NSX Edge VMs.

When the physical server motherboard detects a hardware memory error, it triggers a System Management Interrupt (SMI) to log the Machine Check Exception (MCE) and instructs ESXi to safely retire the bad memory page (ApeiPageRetire). SMIs execute directly in the hardware firmware and completely bypass the OS, resulting in a temporary freeze (micro-stun) of the ESXi hypervisor and all resident virtual machines.

Because the NSX Edge VM is frozen by the hardware during this process, its DPDK datapath is starved of CPU cycles. The Edge is unable to transmit or process BFD keepalive packets. Once the BFD timeout (typically 1.5 seconds) expires, the remote transport nodes declare the session dead and tear down the Geneve overlay tunnels, causing a temporary datapath outage.

Resolution

This is not an NSX or ESXi software defect. It is a physical hardware failure on the underlying server. To resolve the issue permanently:

  1. Immediately vMotion the NSX Edge VM and any other workloads off the affected ESXi host to healthy hardware.
  2. Place the affected ESXi host into Maintenance Mode.
  3. Contact your hardware vendor to replace the faulty memory DIMM.

Additional Information

For more general information regarding the ESXi hypervisor's handling of these memory messages, please refer to: