BFD Session loss between VNF VMs on HPE Synergy Gen 10 after ESXi 7.0.3 Upgrade
search cancel

BFD Session loss between VNF VMs on HPE Synergy Gen 10 after ESXi 7.0.3 Upgrade

book

Article ID: 435507

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Following a scheduled upgrade from ESXi 6.7 to 7.0.3 (Telco bundle upgrade), you could experience a total loss of Bidirectional Forwarding Detection (BFD) sessions across Service Function (SF) VMs and multiple Control Function (CF) VMs.
This can result in a loss of resilient capacity for a very large numbers of users if the problem is not properly addressed 

Symptoms:

  • Total BFD session failure preventing the reintroduction of services.

  • "Golden VM" scenario: Isolated SF VMs  remain functional while others in the same cluster fail.

  • High packet loss or dropped signalling traffic on the DI-network (VNF internal Control Plane)

  • Impact: critical, many user affected

Environment detail:

  • Telco bundle: NFVI TCI 2.2.

  • Application: Cisco VPC-DI (StarOS 2024.03.g3 or newer) - MTX-AGW component

  • Driver: nmlx5-rdma (nVidia/Mellanox).

  • Hardware: HPE Synergy Gen 10.

Environment

VMware ESXi 7.0.3 (Upgraded from 6.7)

Cause

The upgrade to ESXi 7.0.3 re-installed the nmlx5-rdma VIB, which was previously removed in the 6.7 environment. On HPE Synergy Gen 10 hardware with Mellanox ConnectX-5 adapters, this driver initiates a conflict with the VNF packet processing . This conflict specifically disrupts the low-latency heartbeats required for BFD, leading to session timeouts and the subsequent isolation of Service and Control traffic instances. Cisco does not explicitly require RDMA components installed on the ESXi infrastructure.

Resolution

The nmlx5-rdma driver must be purged from the ESXi 7.0.3 hosts to allow BFD sessions to re-establish.

  1. Identify all ESXi hosts in the cluster where BFD sessions are failing.

  2. Evacuate or shut down the affected VNF VMs (CF/SF) and place the host in Maintenance Mode.

  3. Remove the offending RDMA VIB via the ESXi CLI: esxcli software vib remove --vibname=nmlx5-rdma

  4. Reboot the ESXi host.

  5. After the reboot, verify the driver is absent: esxcli software vib list | grep nmlx5-rdma

  6. Exit Maintenance Mode and power on the VNF VMs.

  7. Verify BFD recovery from the VNF CLI