BFD session down when VTEP group member update
search cancel

BFD session down when VTEP group member update

book

Article ID: 408887

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • The BFD session is down on some hosts.
  • On the affected transport nodes you will see the following VTEP group member updates are happening or have happened recently.
     /var/run/log/nsx-syslog.log

    2025-08-12T18:59:23.931Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received ROUTING_DOMAIN_FIB OP msg: id {   self {     op: CLEAR   } } vtep {   self {     op: CLEAR   } } vtep_group_label {   op: CLEAR }
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP msg: id: 367616
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP OP msg (Operation CLEAR): self {   op: CLEAR }
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP msg: id: 262145
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP OP msg (Operation CLEAR): self {   op: CLEAR }
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP msg: id: 332800
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP OP msg (Operation CLEAR): self {   op: CLEAR }
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP msg: id: 343040
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP OP msg (Operation CLEAR): self {   op: CLEAR }
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP msg: id: 346112
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP OP msg (Operation CLEAR): self {   op: CLEAR }
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP msg: id: 284672
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP OP msg (Operation CLEAR): self {   op: CLEAR }
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP msg: id: 378880
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP OP msg (Operation CLEAR): self {   op: CLEAR }
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP msg: id: 359424
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP OP msg (Operation CLEAR): self {   op: CLEAR }
    2025-08-12T18:59:23.935Z In(182) cfgAgent[2102967]: NSX 2102967 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="15385940" level="info"] Decoder: Received L2_VTEP_GROUP msg: id: 301056
    

Environment

VMware NSX 4.x(except 4.2.3 and later)

Cause

  • When creating a BFD VTEP group session, a non-VTEP group session was created at the same time and was mistakenly marked as a VTEP group session.
  • As a result, when the VTEP group message was updated, this non-VTEP group session was deleted, causing the peer-side session to go down.

Example:

  • Edge1 has VTEP 1.1.1.2 and 1.1.1.3, and they form TEP Group 100.
  • ESX1 has VTEP 1.1.1.4. ESX2 also has a VTEP, let's say 1.1.1.11.
  • VtepGroupMsg in nestdb { "id": 100, "vteps": [1.1.1.2, 1.1.1.3] }

  • For ESX2, these tunnels should be created:
    localIP remoteIP tnVgLabel
    1.1.1.11 <-> 1.1.1.2 100
    1.1.1.11 <-> 1.1.1.3 100
    1.1.1.11 <-> 1.1.1.4 N/A
  • But instead we created 1.1.1.11 <-> 1.1.1.4 also with a tnVgLabel 100:
    localIP remoteIP tnVgLabel
    1.1.1.11 <-> 1.1.1.2 100
    1.1.1.11 <-> 1.1.1.3 100
    1.1.1.11 <-> 1.1.1.4 100
  • Now we receive the new VtepGroupMsg for TEP Group 100 { "id": 100, "vteps": [1.1.1.3, 1.1.1.2] }, we want to know which VTEPs are deleted from or added to the TEP Group 100
  • A look up is performed on the above table to find existing tunnels belong to TEP Group 100.
  • In the incorrect case we will get 1.1.1.2, 1.1.1.3, and 1.1.1.4.
    • Because 1.1.1.4 is not part of the newly received VtepGroupMsg 100, we think it should be deleted, so on ESX2 we only have:
      localIP remoteIP tnVgLabel
      1.1.1.11 <-> 1.1.1.2 100
      1.1.1.11 <-> 1.1.1.3 100
  • Therefore, from ESX1 the tunnel between 1.1.1.4 and 1.1.1.11 is down, because ESX2 will no longer send BFD pkt for 1.1.1.4
 

Resolution

Workaround:

  • Restart nsx-cfgagent on affected ESX transport nodes.
    • /etc/init.d/nsx-cfgagent stop
    • /etc/init.d/nsx-cfgagent start

Resolution:

  • Upgrade to 4.2.3 and higher.