The host is encountering repeated LACP Down events.

Products

VMware vSphere ESXi

Issue/Introduction

It is observed in /var/run/log/vmkernel.log that LACP uplink vmnic# flapped.

<DATE_TIME> cpu43:2098552)Team.vswitch: TeamVSLACPLAGEventCB:9083: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]Received event UPLINK LINK STATUS, LAG /945396023, link UNKNOWN, uplink vmnic#/0x86000017, link DOWN
<DATE_TIME> cpu43:2098552)Team.vswitch: TeamVSLACPLAGEventCB:9083: [nsx@6876 comp="nsx-esx" subcomp="vswitch"]Received event UPLINK LINK STATUS, LAG /945396023, link UNKNOWN, uplink vmnic#/0x86000017, link UP

The corresponding packet from the physical switch is not observed in the captures on the ESXi host.

Packets should be observed every second on both the physical switch and the ESXi host, but one packet is missing on the ESXi host.

# Captures on the physical switch

<DATE> <TIME> <Source> Slow-Protocols LACP 124 v1 ACTOR <MAC_ADDRESS> P: 3 K: 102 **DCSGSA PARTNER <MAC_ADDRESS> P: 4 K: 79 **DCSGSA
<DATE> <TIME> <Source> Slow-Protocols LACP 124 v1 ACTOR <MAC_ADDRESS> P: 3 K: 102 **DCSGSA PARTNER <MAC_ADDRESS> P: 4 K: 79 **DCSGSA
<DATE> <TIME> <Source> Slow-Protocols LACP 124 v1 ACTOR <MAC_ADDRESS> P: 3 K: 102 **DCSGSA PARTNER <MAC_ADDRESS> P: 4 K: 79 **DCSGSA
<DATE> <TIME> <Source> Slow-Protocols LACP 124 v1 ACTOR <MAC_ADDRESS> P: 3 K: 102 **DCSGSA PARTNER <MAC_ADDRESS> P: 4 K: 79 **DCSGSA

# Captures on the ESXi host

<DATE> <TIME> <Source> Slow-Protocols LACP 124 v1 ACTOR <MAC_ADDRESS> P: 3 K: 102 **DCSGSA PARTNER <MAC_ADDRESS> P: 4 K: 79 **DCSGSA
<DATE> <TIME> <Source> Slow-Protocols LACP 124 v1 ACTOR <MAC_ADDRESS> P: 3 K: 102 **DCSGSA PARTNER <MAC_ADDRESS> P: 4 K: 79 **DCSGSA
<DATE> <TIME> <Source> Slow-Protocols LACP 124 v1 ACTOR <MAC_ADDRESS> P: 3 K: 102 **DCSGSA PARTNER <MAC_ADDRESS> P: 4 K: 79 **DCSGSA
--> Missing

It is also seen in /commands/nicinfo.sh.txt from the ESXi support bundle that there is a high txBusy number.

NIC: vmnic#
vmnic# 0000:5e:00.0 i40en Up Up 10000 Full <MAC_ADDRESS> 9100 Intel(R) Ethernet Controller X710 for 10GbE SFP+

NIC Private statistics:
Number of packets assigned to an invalid queue: 0

...
txq0: totalPkts=1821079920 totalBytes=678993173391 restartQueue=14222255 txBusy=14219716 queueFull=14219716 pktDropped=0
txq1: totalPkts=1948947390 totalBytes=1195888523074 restartQueue=60998 txBusy=60948 queueFull=60948 pktDropped=0
txq2: totalPkts=10394929 totalBytes=9960475523 restartQueue=22397 txBusy=22382 queueFull=22382 pktDropped=0
txq3: totalPkts=6520813 totalBytes=1164502036 restartQueue=15067 txBusy=15049 queueFull=15049 pktDropped=0
txq4: totalPkts=3025390 totalBytes=759232782 restartQueue=6175 txBusy=6168 queueFull=6168 pktDropped=0
txq5: totalPkts=791696 totalBytes=201090078 restartQueue=101 txBusy=99 queueFull=99 pktDropped=0
txq6: totalPkts=142384 totalBytes=27710611 restartQueue=4 txBusy=4 queueFull=4 pktDropped=0
...
txq23: totalPkts=406627557 totalBytes=58554368208 restartQueue=0 txBusy=0 queueFull=0 pktDropped=0
...

Environment

VMware vSphere ESXi

Cause

The issue can occur as there is a packet loss due to burst IO.

Resolution

To mitigate the issue, consider applying the following two options:

Increase both RX and TX ring buffer sizes on the pNICs, following the guidance in KB 341594.
Enable Ethernet flow control (Pause Frames), following the guidance in KB 324551.