Extended communication disruption (several minutes or more) via L2VPN following an Edge failover
search cancel

Extended communication disruption (several minutes or more) via L2VPN following an Edge failover

book

Article ID: 430672

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • An Edge failover occurred around [Timestamp#1], triggered by an event such as an Edge being placed into maintenance mode or a system failure.
  • Following the failover, communication between segments extended via L2VPN was impacted until approximately [Timestamp#2].
  • In scenarios where a Local Site Edge failover occurs and impacts communication between a machine at the Remote Site (with <Remote_Machine_MAC>) and machines at the Local Site, logs similar to the following may be observed:

    ## Remote Site
    $ grep <Remote_Machine_MAC> Remote_Active_Edge/var/log/syslog
    [Timestamp#1] [Remote_Active_Edge] NSX 5050 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="mac-sync" tname="dp-learning3" level="INFO"] mac sync entry (MAC:<Remote_Machine_MAC>, vni:0(0x0), bridge-port UUID:<bridge-port UUID#1>, vlan_id:<vlan_id#1>, type 0) created
    [Timestamp#1] [Remote_Active_Edge] NSX 5050 SWITCHING [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="lswitch" tname="dp-learning3" level="INFO"] Update the dynamic FDB entry for <0, <vlan_id#1>, <Remote_Machine_MAC>> by a RARP from ifuid 0 to <ifuid#1>
    [Timestamp#2] [Remote_Active_Edge] NSX 5050 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="mac-sync" tname="dp-learning3" level="INFO"] mac sync entry (MAC:<Remote_Machine_MAC>, vni:0(0x0), bridge-port UUID:<bridge-port UUID#1>, vlan_id:<vlan_id#1>) deleted
    $ less Remote_Active_Edge/edge/nsx-agent-state
    ...
        "lswitch/show": [
    ...
            {
                "name": "<name#1>",
                "op_state_up": true,
                "ports": [
                    {
                        "op_state_up": true,
                        "ifuid": <ifuid#1>,
                        "admin_up": true,
                        "lswitch": "<lswitch#1>",
                        "uuid": "<uuid#1>",
                        "peer": "<bridge-port UUID#1>",
                        "op_state": 1,
                        "op_state_mask": 1
                    }
                ],
                "admin_up": true,
                "device-admin-state": "Up",
                "vlan": <vlan_id#1>,
                "uuid": "<uuid#1>",
                "transport_zone_id": "<transport_zone_id#1>",
                "flags": 0,
                "is_punt_port_switch": false,
                "ha_op_up": false,
                "device-state": "Up",
                "device": "fp-eth0"
            },
    ...


    ## Local Site

    $ grep <Remote_Machine_MAC> <Local_Active_Edge>/var/log/syslog
    [Timestamp#2] <Local_Active_Edge> NSX 2813246 FABRIC [nsx@6876 comp="nsx-edge" subcomp="datapathd" s2comp="mac-sync" tname="dp-learning4" level="INFO"] mac sync entry (MAC:<Remote_Machine_MAC>, vni:0(0x0), bridge-port UUID:<bridge-port UUID#2>, vlan_id:<vlan_id#2>, type 0) created
    $ less Local_Active_Edge/edge/nsx-agent-state
    ...
        "lswitch/show": [
    ...
            {
                "name": "<name#2>",
                "op_state_up": true,
                "ports": [
                    {
                        "op_state_up": true,
                        "ifuid": <ifuid#2>,
                        "admin_up": true,
                        "lswitch": "<lswitch#2>",
                        "uuid": "<uuid#2>",
                        "peer": "<bridge-port UUID#2>",
                        "op_state": 1,
                        "op_state_mask": 1
                    }
                ],
                "admin_up": true,
                "device-admin-state": "Up",
                "vlan": <vlan_id#2>,
                "uuid": "<uuid#2>",
                "transport_zone_id": "<transport_zone_id#2>",
                "flags": 0,
                "is_punt_port_switch": false,
                "ha_op_up": false,
                "device-state": "Up",
                "device": "fp-eth0"
            },
    ...

Environment

VMware NSX

Cause

  • During an Edge failover where L2VPN services are configured, RARP packets are sent for machines connected to segments extended to the peer site. Due to behaviors described in KB#323331, KB#322645, KB#329011, or KB#379904, if these RARP packets are forwarded to the peer site via the L2VPN session, it can lead to abnormal MAC learning and subsequent communication issues. In the example above, a RARP packet sent from the Local Site was forwarded to the Remote Site, causing the MAC address to be incorrectly learned on the bridge port. The issue typically resolves automatically once the machine-side devices perform re-learning (e.g., via aging timers).

  • Through the MAC Sync mechanism, the active Edge node in an NSX Edge cluster continuously synchronizes the learned L2 MAC table with the standby Edge node. This prevents communication disruption when a failover occurs. Synchronization of remote MAC addresses learned via L2VPN is performed within the Edge cluster. If operating normally, in the example mentioned above, the MAC Sync entry for <Remote_Machine_MAC> can be confirmed on the HA Edge pairs of the Local Site:
    root@<Local_Edge>:~# edge-appctl -t /var/run/vmware/edge/dpd.ctl mac-sync/show-table | python3 -m json.tool
    ...
        {
            "mac": "<Remote_Machine_MAC>",
            "vni": 0,
            "type": "Local",
            "vlan_id": <vlan_id#2>,
            "bridge-port-uuid": "<bridge-port UUID#2>"
        },
    ...

Resolution

  • Verify your environment's configuration to determine if it is affected by the symptoms described in KB#323331, KB#322645, KB#329011, or KB#379904.
  • If this does not fall under any known cases where RARP reflection occurs, it will be necessary to obtain and investigate packet captures during the reproduction of the issue.

Additional Information

KB#323331 : Intermittent packet loss may occur when bridging is configured on NSX or using HCX Network Extension

KB#322645 : L2 loop causes invalid MAC learning on NSX-T Edge Node when NPAR TX switching and L2VPN is enabled

KB#329011 : Duplicate Multicast or Broadcast Packets are Received by a Virtual Machine When the Interface is Operating in Promiscuous Mode

KB#379904 : Advanced networking settings like ReversePathFwdCheckPromisc and IGMPVersion do not work after upgrading VDS or uninstalling NSX.