vSAN Cluster might experience network instability issues with Cisco ACI and its Control Plane Learning feature enabled
search cancel

vSAN Cluster might experience network instability issues with Cisco ACI and its Control Plane Learning feature enabled

book

Article ID: 385544

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

  • Using vSAN with nodes in different, routed subnets (such as usually seen with vSAN Stretched Clusters)
  • vSAN Performance Service is enabled on the affected vSAN Cluster
  • The vSAN cluster experienced network issues between the routed subnets
  • Using Cisco ACI with its "Control Plane Learning" and "Rogue Endpoint Control" features enabled on the vSAN- and Management-network
  • Cisco ACI is reporting IP flapping of the vmkernel adapters where vSAN is assigned on (typically alert "F3013: fltEpmRogueIpEpEpIPRogue" is raised)
  • This behavior should only be observed in ESXi 8.0 Update 2 and later, due to vSAN performance metrics now being sent via HTTPS TCP/443 while using HTTP TCP/80 before.

Cause

When vSAN Performance Service is enabled, all vSAN nodes in the same cluster are periodically sending latest metrics to the current vSAN master host. However under some circumstances, TCP RESET packets are being sent to other vSAN nodes using a wrong physical network uplink, which do still have the Source-IP from the vmkernel adapters used by vSAN. (For example: vSAN's vmk3 is only assigned to vmnic3. However, these packets from vmk3 are then unexpectedly leaving vmnic0 used solely for Management.)

These rogue IP TCP packets lead to Cisco ACI "learning" the vSAN-IPs on a wrong physical uplink and make unexpected routing decisions, and possibly sending vSAN data network traffic along a wrong path. This can affect the overall vSAN stability due to network connectivity issues between vSAN nodes across routed subnets.

Resolution

There is currently no available resolution.

The issue has been identified and is addressed in a future release.

Workaround

  1. Disable "Control Plane Learning" for affected networks, or
  2. Temporarily configure Static Routes between the vSAN subnets.
    (Note: It is also important to apply this before adding a vSAN node to an existing cluster!)