Continuous BGP flaps on a virtual Edge node due to physical nic CRC errors
search cancel

Continuous BGP flaps on a virtual Edge node due to physical nic CRC errors

book

Article ID: 442417

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • You observe continuous BGP flaps on a virtual Edge node.
  • L2 connectivity failures occur between the edges, confirming a network layer problem originating from the affected edge node towards all other connections.
  • When the edge node is vmotioned to another stable ESXi, the issue is resolved and there are no connectivity issues observed on the affected edge node/s.
  • Further investigation reveals that the vmnics used by the affected edge node on the problem ESXi has a lot of CRC errors which are incrementing over short span of time too leading to network instability of VM's placed on the problem vmnic/s on that specific ESXi
  • To verify CRC issues run the below command.
    • esxcli network nic stats get -n vmnic<#>
      Receive CRC errors: <####>   <----- This number increased

     

Environment

VMware NSX
VMware vSphere ESXi

Cause

  • The Edge node is utilizing a physical NIC (such as vmnicX) that is dropping packets due to CRC errors.
  • ESXi logs on the host reveal a massive number of CRC errors on the specific interface, which steadily increase over time.
  • These errors directly cause L2 connectivity drops and subsequent BGP flapping.

Resolution

The CRC stands for "Cyclic Redundancy Check". The FCS (Frame Check Sequence) field contains a 4-byte CRC value used for error checking. When a source host assembles a packet, it performs a CRC calculation on all fields in the packet except the Preamble, SFD (Start Frame Delimiter), and FCS using a predetermined algorithm. The source host stores the value in the FCS field and transmits it as part of the packet. When the packet is received by the destination host, it performs a CRC test again by using the same algorithm. If the CRC value calculated at the destination host does not match the value in the FCS field, the destination host discards the packet, considering this as a CRC Error.

To stabilize the BGP sessions and address the host hardware faults, follow the below steps:

  1. Migrate the Edge node: Migrate the Edge node to a different stable ESXi host. This moves the routing workload off the faulty hardware and immediately stabilizes the BGP sessions.

  2. Engage your hardware vendor: Work with your hardware vendor to troubleshoot the physical hardware and the CRC errors occurring on the NIC cards.

  3. Upgrade firmware and drivers: Upgrade the NIC drivers and firmware to the latest compatible combination. Ensure the ESXi host is completely stable before returning it to the cluster for production workloads.

Additional Information

Troubleshooting and understanding physical NIC receive or transmit dropped, missed and error counters in ESXi