Network adapter (vmnic) is down or fails with a Failed Criteria Code

Products

VMware vSphere ESXi

Issue/Introduction

This article provides information on troubleshooting issues when a network adapter fails.
The ESXi/vCenter UI and ESXi logs are showing NIC adapter alerts and messages, such as the 'Network uplink redundancy degraded' or 'Network uplink redundancy lost' or 'Lost Network Redundancy' alarm.
This KB goes over typical checks that can be done for troubleshooting.

Impact/Risks:

Packet flow associated with the services associated with the affected portgroup (either a standard switch portgroup, or a distributed virtual switch (DVS) portgroup) will cease along the data path associated with the named physical uplink (vmnic).This will impact one or more of the following:

Virtual machines
Vmkernel network adapters (including services like Management, vMotion, VSAN, iSCSI, NFS, and NSX)
Physical Nic interfaces

The trigger for an event such as "Up" or "Down" is typically an external event upstream from the physical NIC.
The first step would be to ask the team that manages the physical infrastructure, external to the affected ESXi host, to investigate for reasons they may see for the event, in their switch logs for the physical switch and/or the switchport on the physical switch to which the vmnic is connected.
Instances where uplinks are having state changes causing the interface to go down and up in a relatively small amount of time (under a second - micro-flapping).
Regardless, the ESXi host logs will provide the opportunity for timeline analysis of the events.
It is important to note that logs never reveal causes -- they only reveal effects. But possible root causes can be investigated, once a clear understanding is available as to what events the ESXi host experienced, and when. For more information on how to collect logs, see Collecting diagnostic information for VMware ESXi

Environment

VMware vSphere ESXi
VMware vCenter Server

Cause

In some cases, a vmnic can fail because of device firmware and/or device driver issues.

Insight into this can be found typically in the /var/run/log/vmkernel.log and its historical rotations.
For further information on device firmware / drivers, please see FAQ: Recommendation for drivers/firmware

If there is no obvious log event that would suggest a device driver / firmware issue, the next step is to ask the team that manages the physical infrastructure external to the affected ESXi host, to investigate for reasons they may see for the event in their switch logs for the physical switch and/or the switchport on the physical switch to which the vmnic is connected.

If that team does not find anything in their logs, then get a timeline analysis done by Broadcom Support.

The logs required are the ESXi host logs, per Collecting diagnostic information for VMware ESXi
After the logs are collected, open a Support case, per Creating and managing Broadcom support cases
Attach the collected logs to the Case, per Uploading files to cases on the Broadcom Support Portal

In addition to the logs outlined above, useful information to include with the Problem Statement when opening a case would be:

The name of the ESXi host(s) where the symptoms are observed.
The most recent date, time, and time zone when the symptoms were not observed, prior to the onset of the symptoms to be investigated.
Actions taken and the corresponding timestamps following the observation of the symptoms.
The current state of the host (example: Maintenance Mode, or Not Responding when viewed from vCenter, etc.)

Resolution

To determine the cause of the failure or eliminate common NIC issues:

Check the current status of the vmnic from either the VMware vSphere Client or the command line via SSH or a KVM server console.
- To check the status from the vSphere Client:
  1. Select the ESX host and click the Configuration tab.
  2. Click Networking.
  3. The vmnics currently assigned to virtual switches are displayed in the diagrams. If a vmnic displays a red X, that link is currently down.
- To check the status from the command line via SSH or a KVM server console, run the following command:
  
  esxcli network nic list
  
  The output appears similar to this:
```
Name     PCI Device    Driver  Admin Status  Link Status   Speed  Duplex  MAC Address         MTU  Description
-------  ------------  ------  ------------  -----------  ------  ------  -----------------  ----  -----------
vmnic0   0000:01:00.0  ixgben  Up            Up             1000  Full    ec:f4:##:##:##:##  1500  Intel(R) Ethernet Controller X540-AT2
vmnic1   0000:01:00.1  ixgben  Up            Up             1000  Full    ec:f4:##:##:##:##  1500  Intel(R) Ethernet Controller X540-AT2
```
  Note: The Admin Status is the only portion of the output that ESXi controls. The status can be changed by using the following commands:

esxcli network nic down -n vmnic#
esxcli network nic up -n vmnic#

The Link Status column specifies the status of the link between the physical network adapter and the physical switch.

The status can be either Up or Down. If there are several network adapters, with some being up and some down, then verify if they are intended to be connected. In many environments, vmnics are installed, but not connected by design.

2. Check that the vmnic referred to in the event message is still connected to the switch and configured properly:

Make sure that the network cable is still connected to the switch and to the host.
Check that the switch connected to the system is still functioning properly and has not been misconfigured. Refer to the switch documentation for details.
Check for activity between the physical switch and the vmnic. This might be indicated either by a network trace or activity LEDs.
Check that the NIC driver is up to date: Determining Network/Storage firmware and driver version in ESXi.

3. Search for the word "vmnic" in `/var/run/log/vobd.log` log file.

If "vmnic down" or "vmnic up" messages are observed, it may indicate that the NIC is flapping.

Note: Some NICs report the NIC link up state only, not the down state. If the NIC is reported as "up" and the host was not rebooting, this is an indication that the NIC is flapping and not reporting the down state to ESXi.

Timestamps suffixed with the letter "Z" (as shown in the example below) are in UTC (Coordinated Universal Time). Credible internet references can be used to convert the UTC time to the equivalent local time zone.

Check for a failed criteria code with the vmnic messages. If there is a failed criteria code listed, please see step 4 below.

If there is no failed criteria code, and everything was checked in step 2 above, we suggest opening a case with the hardware vendor and have them investigate.

4. In the`/var/run/log/vobd.log` file, the vmnic failure may be classified with a Failed criteria code. This code explains the reason for the vmnic failure.

Example:

YYYY-MM-DDThh:mm:ss.330Z: [netCorrelator] 4836107000843us: [vob.net.dvport.uplink.transition.down] Uplink: vmnic4 is down. Affected dvPort: ##/50 24 e2 d9 41 e2 48 58-## ## ## ## ## ## ## ##. 3 uplinks up. Failed criteria: 128

Time - Event - Uplink# - State - Port - vSwitch - # Active Uplinks left - Failed Criteria

Note: # Active Uplinks left is an indication of a failover, which identifies the number of active uplinks left in the teaming policy of the virtual switch.

The following are the failed criteria codes:

1 – Link speed reported by the driver (exact match for compliance)
2 – Link speed reported by the driver (equal or greater for compliance)
4 – Link duplex reported by the driver

8 – Link LACP state down
32 – Beacon probing
64 – Errors reported by the driver or hardware
128 – Link state reported by the driver

256 – The port is blocked
512 – The driver has registered the device

Note: Failed Criteria 128 is the driver reporting a link state down. This can be caused by unplugging the network cable or administratively shutting down the physical switchport. If this was not an intended link outage it will likely be an issue with the driver, firmware, SFP+ module, cable, and/or switchport of the physical switch. Check the driver by following the below KB, and contact the host hardware vendor for further troubleshooting when failed criteria 128s are seen in the vobd log. Determining Network/Storage firmware and driver version in ESXi

Additional Information

The criteria that are used to determine if a network adapter in a network adapter team has failed include:

checkBeacon – By default, this check is disabled. This check becomes active when Beacon Probing is enabled on a virtual switch.
checkDuplex – By default, this check is disabled.
- If checkDuplex is true, the configured duplex mode is fullDuplex and the link is considered to be bad if the link duplex reported by the driver is not the same as fullDuplex
- If checkDuplex is false, fullDuplex is unused and the link duplexity is not used as a detection method.
checkErrorPercent – By default, this check is disabled.
- If checkErrorPercent is true, the percentage mentioned in the criteria is the configured error percentage that is tolerated. The link is considered to be bad if the error rate exceeds the percentage.
- If checkErrorPercent is false, the percentage is unused, and the error percentage is not used as a detection method.
checkSpeed – The default setting is Minimum and has a default value of 10Mbps.

To use link speed as the criteria, checkSpeed must be one of these values:
- exact – Use exact speed to detect link failure. Speed is the configured exact speed in megabits per second.
- minimum – Use minimum speed to detect failure. Speed is the configured minimum speed in megabits per second.
- empty string – Do not use link speed to detect failure. Speed is unused in this case.

The Failed criteria code of 32 indicates the link has failed due to Beacon Probing detecting a problem. Beacon Probing sends beacons per VLAN between physical NICs in a team. When these are not received by other NICs this means that there is a problem in the physical network.

Note: The failure codes are accumulative so they can be added together when multiple criteria are met.

When there are multiple failures, entries similar to these are seen in the /var/run/log/vobd.log file:

YYYY-MM-DDThh:mm:ss.449Z: [netCorrelator] 1123644995238us: [vob.net.pg.uplink. transition.down] Uplink: vmnic# is down. Affected portgroup: ########. 0 uplinks up. Failed criteria: 130

The failed criteria here is 130, which is 2 + 128. This is a combination of these two failure codes:
Link speed reported by the driver (equal or greater for compliance)
Link state reported by the driver