NSX-T VDR cannot resolve SR backplane MAC address when both the workload VM and the Edge VM are on the same host.
search cancel

NSX-T VDR cannot resolve SR backplane MAC address when both the workload VM and the Edge VM are on the same host.

book

Article ID: 376796

calendar_today

Updated On:

Products

VMware NSX Networking VMware NSX-T Data Center VMware NSX

Issue/Introduction

In NSX-T environments, the Virtual Distributed Router (VDR) may fail to resolve the Service Router (SR) backplane MAC address if both the workload VM and the Edge VM are on the same host. This issue arises from having two different values for the same configuration key (com.vmware.port.extraConfig.vdl2.nestedTNConfig) in the logical port configuration.

  • com.vmware.port.extraConfig.vdl2.nestedTNConfig is attribute which is set internally by NSX-T on Logical Port which is connected to Edge Node VNIC. 
  • In 3.1.x release com.vmware.port.extraConfig.vdl2.nestedTNConfig is populated in logical port extra_config attribute.
  • In 3.2.0 and above releases com.vmware.port.extraConfig.vdl2.nestedTNConfig is populated in logical port system_extra_config attribute which is not exposed to user. But in 3.2.0 and above releases duplicate  com.vmware.port.extraConfig.vdl2.nestedTNConfig entry in extra_config is never removed. Due to this, when user upgrades from 3.1.x to 3.2.0 and above releases, then there are two com.vmware.port.extraConfig.vdl2.nestedTNConfig  entries for given logical port, one in extra_config and other in system_extra_config
  • Duplicate com.vmware.port.extraConfig.vdl2.nestedTNConfig entries get set in the LogSwitchPortConfigMsg → extra_config sent by the Management Plane (MP) to Host node on which Edge resides. This leads to undefined behaviour in the Host configuration agent (cfgAgent). 

 

  • Symptoms:

    • Traffic disruption is seen in a collapsed-cluster environment where the Edge VM and the workload VM are on the same host. When Edge and Host node has same TEP VLAN and duplicate com.vmware.port.extraConfig.vdl2.nestedTNConfig entries are present, then tunnels between Edge and Host node will go down.
    • User can use findAllImpactedLogicalPortsGroupedByEdgeNodes.py attached script to get list of impacted segment ports/logical ports(which has duplicate com.vmware.port.extraConfig.vdl2.nestedTNConfig entries) grouped by Edge Nodes 
    • findAllImpactedLogicalPortsGroupedByEdgeNodes.py script is to be run on NSX manager node. This Script generates 2 files in output - impacted_ports_grouped_by_edge.json and impacted_edge_uuids.txt . impacted_ports_grouped_by_edge.json file shows impacted logical ports/ segment ports grouped by edge nodes. impacted_edge_uuids.txt file has list of impacted edge node UUIDs 
    • Corresponding to these impacted Segment Ports/Logical Ports, LogSwitchPortConfigMsg entry in Host Node(on which affected Edge resides) will show duplicate com.vmware.port.extraConfig.vdl2.nestedTNConfig extra_config entries

Example Object Type: vmware.nsx.nestdb.LogSwitchPortConfigMsg

[root@xx-xx-xx-xx-xx-xx:~] /opt/vmware/nsx-nestdb/bin/nestdb-cli --beautify --cmd get vmware.nsx.nestdb.LogSwitchPortConfigMsg  | grep nested

{'id': {'left': ###########, 'right': ###########},
 'log_switch_id': {'left': ###########, 'right': ###########},
 'attachment': {'vif_attachment': {'vif_id': '######################', 'type': 'INDEPENDENT'}},

 

          'extra_config': [{'key': 'com.vmware.port.extraConfig.vdl2.nestedTNConfig',
                   'value': 'version=1;vlan=###,label=XXXXX;vlan=###,label=XXXXX'},
                  {'key': 'com.vmware.port.extraConfig.vdl2.nestedTNConfig',
                   'value': 'version=1;vlan=###,label=YYYYY;vlan=###,label=YYYYY'}]}

 

 

Observations:

  • The logical port contains the key com.vmware.port.extraConfig.vdl2.nestedTNConfig with differing values in extra_config and system_extra_config.
  • On the host, there may be invalid values for the key com.vmware.port.extraConfig.vdl2.nestedTNConfig. Use commands like net-dvs -l.
    •  port ######-####-####[PortID]:
          com.vmware.common.port.alias =  ######-####-####[PortID] ,   propType = CONFIG 
          com.vmware.common.port.connectid = ######### ,   propType = CONFIG 
          com.vmware.common.port.backingType = nsx ,  propType = CONFIG   
          com.vmware.port.extraConfig.vdl2.nestedTNConfig = version=1;vlan=##,label=111617 ,  propType = POLICY 

  • ARP resolution by VDR fails.

 

Relevant Logs

  • NSX API Log: /var/log/proton/nsxapi.log, desired_state_manager.json
  • Host Logs: net-dvs_l.txt
  • CCP Dumps: data_dump and adaptor_ufo_dump

Environment

Impacted Version:

  • 3.2.0 and above.
  • Issue is only seen for NSX-T environments which are upgraded from "3.1.x" to "3.2.0 and above"

Cause

The issue arises due to having two different values for the same configuration key (com.vmware.port.extraConfig.vdl2.nestedTNConfig) in the logical port configuration. This discrepancy results in network traffic failures when VMs and the Edge VM are on the same host, potentially causing significant disruptions.

Resolution

This issue will be resolved in a later version of NSX. 

Workaround for 4.1 and above releases 

For workaround user can run attached fixNestedTNConfigForAllLPsOnEdgeNode_version_4_1_and_above.py script. It is preferable to take a maintenance window before applying workaround.

Steps To run Script:

  • Download attached fixNestedTNConfigForAllLPsOnEdgeNode_version_4_1_and_above.py python script and store on NSX-Manager node
  • Script needs list of impacted edge UUIDs in a file. Review and use impacted_edge_uuids.txt generated by findAllImpactedLogicalPortsGroupedByEdgeNodes.py script 
  • Run command : 

    python3 fixNestedTNConfigForAllLPsOnEdgeNode_version_4_1_and_above.py <edge_uuid_file>

            <edge_uuid_file> → This file contains list of impacted edge UUIDs. Each edge node UUID needs to be in separate line in this file. Review and use impacted_edge_uuids.txt generated by findAllImpactedLogicalPortsGroupedByEdgeNodes.py script 

 

Script performs below steps for fixing issue:

  • Script accepts impacted Edge nodes UUID list in a file from user as argument. impacted_edge_uuids.txt file generated by findAllImpactedLogicalPortsGroupedByEdgeNodes.py script can be used after user reviews impacted edge list in this file.
  • When script is run, for all given impacted edge nodes script removes duplicate "com.vmware.port.extraConfig.vdl2.nestedTNConfig" "extra_config" entry from all impacted segment/logical ports(which are attached to edge vNIC) using API calls.

 

Additional Information

If you suspect you are experiencing this issue and need assistance with validation, please open a support case with Broadcom.

Attachments

findAllImpactedLogicalPortsGroupedByEdgeNodes.py get_app
fixNestedTNConfigForAllLPsOnEdgeNode_version_4_1_and_above.py get_app