Edge failovers caused by CPU lockups on the Edge leading to the BFD tunnels\process to time out

Products

VMware NSX

Issue/Introduction

Edge fails over because All GENEVE tunnels from the Edge are reported as 'DOWN'.
At the time of the failover in the Edge /var/log/syslog something similar to the following is logged, which indicates that there is an issue scheduling the BFD process:

2025-01-14T21:45:44.219Z <Edge hostname> NSX 4931 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" tname="dp-bfd-mon4" level="WARN"] BFD module wakeup interval exceeds maximum threshold. INTV: 60533
2025-01-14T21:45:44.213Z <Edge hostname> NSX 4931 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD rx enq interval exceeds maximum threshold. INTV: 55281
2025-01-14T21:45:49.386Z <Edge hostname> NSX 4931 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD rx enq interval exceeds maximum threshold. INTV: 57032
2025-01-14T21:45:49.387Z <Edge hostname> NSX 4931 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD tx interval exceeds maximum threshold. INTV: 65727
2025-01-14T21:45:49.387Z <Edge hostname> NSX 4931 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD tx interval exceeds maximum threshold. INTV: 65727
2025-01-14T21:47:04.681Z <Edge hostname> NSX 4931 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" tname="dp-bfd-mon4" level="WARN"] BFD module wakeup interval exceeds maximum threshold. INTV: 18673
2025-01-14T21:47:04.681Z <Edge hostname> NSX 4931 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="dp-bfd" level="INFO"] BFD tx interval exceeds maximum threshold. INTV: 18672

Around the time of the Edge failover CPU lockup messages like the following are seen in the Edge logs in /var/log/syslog:

2025-01-14T21:45:45.476Z <Edge hostname> kernel - - - [9634236.013428] watchdog: BUG: soft lockup - CPU#1 stuck for 54s! [swapper/1:0]
2025-01-14T21:45:45.495Z <Edge hostname> kernel - - - [9634236.013441] watchdog: BUG: soft lockup - CPU#0 stuck for 51s! [swapper/0:0]
2025-01-14T21:45:45.496Z <Edge hostname> kernel - - - [9634236.013480] watchdog: BUG: soft lockup - CPU#2 stuck for 58s! [swapper/2:0]

Environment

VMware NSX

Cause

CPU soft lockups on the Edge VM causing processes like BFD on the Edge to be blocked from being scheduled on CPU.
GENEVE tunnels will go down once BFD timers timeout as expected.
Edge will trigger a failover when all GENEVE tunnels go down which is expected.

Resolution

A CPU soft lockup occurs when a task or kernel thread on a CPU gets stuck, essentially not releasing the CPU for an extended period, causing the system to become unresponsive for a short time. There can be a number of reasons for a CPU lockup to occur such as the following:

- CPU contention on the host. Use tools like Esxtop or the Aria Operations graphs for signs of contention such as high CPU ready or CPU Co-stop times. If host contention is identified then consider doing a vmotion to a less busy host.
- CPU contention within the Edge VM. Use tools like Top to identify sources of contention. If VM level contention is identified then consider increasing the size of the Edge, but consult with the NSX maximums first.
- Faulty hardware like a failing CPU or memory. Engage with your hardware vendor to determine if there are any HW issues on the host. Try vmotion the Edge to another host to see if it resolves the issue.
- High Read or Write latency to the storage backing the Edge VM. Use tools like Esxtop or the Aria Operations graphs for sign of high read\write latency and peaks in 'CPU | Other Wait' at the times of the failover.

If the cause is related to high latency on the storage, and the datastore is vSAN, then you may see the following in the host /var/run/log/vmkernel.log:

2025-01-22T07:55:19.430Z cpu89:3788044)HBX: 3063: '<UUID>': HB at offset 3801088 - Waiting for timed out HB:
2025-01-22T07:55:19.430Z cpu89:3788044) [HB state abcdef02 offset 3801088 gen 109 stampUS 6218700352409 uuid <UUID> jrnl  drv 24.82 lockImpl 4 ip <IP ADDRESS>]
2025-01-22T07:55:20.118Z cpu90:6655841)HBX: 3063: '<UUID>': HB at offset 3801088 - Waiting for timed out HB:
2025-01-22T07:55:20.118Z cpu90:6655841) [HB state abcdef02 offset 3801088 gen 121 stampUS 6218700352341 uuid <UUID> jrnl  drv 24.82 lockImpl 4 ip <IP ADDRESS>]

2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2121: [39358042:0x45bb3ef77ac0] => Stuck descriptor
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2128: [39358042:0x45bb3ef77ac0] => writeWithBlkAttr5, PREPARING, ASYNC, not complete
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2133: [39358042:0x45bb3ef77ac0] => op(0x45bafe66cb00), CSN(10322), rangemapKey(5214828), rangeOffset(3801088), rangeLen(4096), retries(0)
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2142: [39358042:0x45bb3ef77ac0] => Inclusive commit list empty
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2121: [27154287:0x45bb3efc86c0] => Stuck descriptor
2025-01-22T07:55:27.624Z cpu16:2098891)DOM: DOM2PCPrintDescriptor:2128: [27154287:0x45bb3efc86c0] => writeWithBlkAttr5, PREPARING, ASYNC, not complete

A possible cause for the above vSAN logs can be external network issues or flapping pNIC's on any of the hosts in the vSAN cluster. Consider opening a case with vSAN support to help identify possible causes for the latency. The vSAN case should include host logs from all hosts in the vSAN cluster, which cover the time of the failover and includes the performance traces as per KB 326959.

Edge failovers caused by CPU lockups on the Edge leading to the BFD tunnels\process to time out

Article ID: 389382

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Feedback