Bare Metal Edge server has 2 port network card with one port/interface fp-ethX being in DOWN state and not coming up - LACP is configured for these interfaces
search cancel

Bare Metal Edge server has 2 port network card with one port/interface fp-ethX being in DOWN state and not coming up - LACP is configured for these interfaces

book

Article ID: 409258

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

- Bare Metal Edge is healthy but has one interface fp-ethX with link status being DOWN

- This interface admin state is UP, meaning the physical card is correctly detected

- Configuration of the Edge uplink profile looks good (LACP/LAG is setup)

- The other side is Cisco leaf switch connecting to this Bare Metal Edge server

Environment

VMware NSX

Cause

- As the Admin state is UP on this interface indicating the card is good.

Common causes and troubleshooting steps

1. Configuration mismatch
The most frequent cause of LACP failures is a misconfiguration between the NSX bare metal edge and the physical top-of-rack (ToR) switch.
LACP mode: Verify that the LACP mode is configured consistently on both sides. If the NSX edge's LACP mode is "Active," the physical switch port-channel should also be "Active." A mismatch (e.g., active/passive) can cause negotiation failure.
LAG settings: Confirm that the LAG identifier, LACP system ID, and other parameters are correctly set and consistent. In a Multi-chassis Link Aggregation Group (MLAG) environment, verify that the system IDs are unique and that the LAG configuration is identical across the peer switches.
Port-channel status: Check that the port-channel interface on the physical switch is in an "administratively up" state. 

2. Physical layer issues
Problems with the physical infrastructure can prevent LACP from negotiating successfully.
Cable connectivity: Check the network cables for proper seating and physical damage. Test the physical links by swapping cables and optics, if applicable.
Speed and duplex settings: Ensure that the speed and duplex settings are consistent on both the bare metal edge and the physical switch.
Fiber link issues: If fiber links are used, check for a one-way connection caused by faulty optics or cables. 

3. Bare metal edge issues
Problems on the NSX bare metal edge itself can also be the cause.
Check datapath interfaces: On the bare metal edge, use the ovs-appctl bond/show and ovs-appctl lacp/show commands to inspect the status of the bond and its members.
Restart dataplane service: A previous configuration change may not have correctly reset the interface members. In this case, restarting the dataplane can resolve the issue: restart service dataplane.
Upgrade failures: If the LACP failure occurred after an NSX-T upgrade, the transport node may need to be redeployed. Refer to Broadcom support articles related to upgrade failures.
NUMA node configuration: In specific hardware scenarios, incorrect Non-Uniform Memory Access (NUMA) node configuration can prevent proper load balancing and lead to LACP failures. The datapath cores backing the bond should be on the same NUMA node. This can often be fixed by disabling the sub-NUMA functionality in the server's BIOS. 

- Switch connecting to BME uplink interfaces also has LACP setup but logs are indicating port is in Err-disabled state

Common Reasons for LACP Errdisable in Cisco switch side:

1. EtherChannel Misconfiguration:
The most frequent cause is a mismatch in configuration between the switches on either end of the EtherChannel. This can include: 
Incorrect LACP mode (active/passive vs. on). 
Incompatible EtherChannel settings between the ports. 
An issue with the physical cables or SFP modules. 

2. LACP PDU Issues:
Unexpected LACP PDU exchanges from the partner device can trigger an error, especially during the initial setup of the port channel. 
The show etherchannel summary command can help verify the LACP PDU counters on both sides of the link. 

Resolution

To resolve the issue, follow this systematic approach:

Check NSX Manager UI: Navigate to System > Fabric > Nodes > Edge Transport Nodes and find the bare metal edge. Check the status column for any issues.

Use edge commands: Log into the bare metal edge and use ovs-appctl commands to check the bond and LACP status for the fp-eh interface.

Inspect physical switches: Examine the switch port-channel configuration and LACP counters to ensure they are configured to match the NSX edge and are sending and receiving LACP PDUs correctly. Resolve any ports that are in Err-disabled state in the switch.

Restart services: If the above steps show a mismatch or a stale state, restart the dataplane service on the edge to re-establish the bond.

Engage support: If the issue persists, collect logs from both the NSX edge and the physical switches, and contact Broadcom support for further assistance. 

Additional Information

There are several knowledge base articles addressing common LACP issues on NSX edges. Review these articles for known issues that match your environment, such as issues related to: 

Physical links going down after uplink profile changes
Lag Member Down alarms reported in the NSX UI