Following a scheduled upgrade from ESXi 6.7 to 7.0.3 (Telco bundle upgrade), you could experience a total loss of Bidirectional Forwarding Detection (BFD) sessions across Service Function (SF) VMs and multiple Control Function (CF) VMs.
This can result in a loss of resilient capacity for a very large numbers of users if the problem is not properly addressed
Symptoms:
Total BFD session failure preventing the reintroduction of services.
"Golden VM" scenario: Isolated SF VMs remain functional while others in the same cluster fail.
High packet loss or dropped signalling traffic on the DI-network (VNF internal Control Plane)
Impact: critical, many user affected
Environment detail:
Telco bundle: NFVI TCI 2.2.
Application: Cisco VPC-DI (StarOS 2024.03.g3 or newer) - MTX-AGW component
Driver: nmlx5-rdma (nVidia/Mellanox).
VMware ESXi 7.0.3 (Upgraded from 6.7)
The upgrade to ESXi 7.0.3 re-installed the nmlx5-rdma VIB, which was previously removed in the 6.7 environment. On HPE Synergy Gen 10 hardware with Mellanox ConnectX-5 adapters, this driver initiates a conflict with the VNF packet processing . This conflict specifically disrupts the low-latency heartbeats required for BFD, leading to session timeouts and the subsequent isolation of Service and Control traffic instances. Cisco does not explicitly require RDMA components installed on the ESXi infrastructure.
The nmlx5-rdma driver must be purged from the ESXi 7.0.3 hosts to allow BFD sessions to re-establish.
Identify all ESXi hosts in the cluster where BFD sessions are failing.
Evacuate or shut down the affected VNF VMs (CF/SF) and place the host in Maintenance Mode.
Remove the offending RDMA VIB via the ESXi CLI: esxcli software vib remove --vibname=nmlx5-rdma
Reboot the ESXi host.
After the reboot, verify the driver is absent: esxcli software vib list | grep nmlx5-rdma
Exit Maintenance Mode and power on the VNF VMs.
Verify BFD recovery from the VNF CLI