Error On vSAN skyline Healthcheck -vMotion: MTU check (ping with large packet size)

Products

VMware vSAN

Issue/Introduction

Introduction:

The Maximum Transmission Unit (MTU) health check, also called "MTU check (ping with large packet size)" complements the basic connectivity check for vMotion traffic.
The MTU Check warning can be caused due to mismatch of MTU between the vSphere environment and Physical Switch.
- Example: When the vmknic has an MTU of 9000 and the Physical switch enforced at 1500, the packets are dropped at physical switch when the source does not fragment the packets.

Symptoms:

Below error is notice on skyline health.
- Cluster > Monitor > Skyline health >
Vmotion task takes longer time or fails.

Error skyline health:

Verification:

Identify vmkernel, vmnic used for vMotion traffic and its MTU size configured.
- Make sure the MTU is consistently configured across the cluster.

- Identify the vMotion network on the ESXi host.
  - [root@esxi2:~] esxcfg-vmknic -l | grep vMotion
    vmk1 vMotion IPv4 ###.###.##.### 255.255.255.0 ###.###.#.### 00:50:56:##:##:## 1500 65535 true STATIC defaultTcpipStack
    vmk1 vMotion IPv6 fe80::250:####:####:#### 64 00:50:56:##:##:## 1500 65535 true STATIC, PREFERRED defaultTcpipStack
  - The above checks indicates that vmk1 is used for vMotion traffic.
- Check the MTU setting on the vSwitch.
  - [root@esxi2:~] esxcfg-vswitch -l
    
    Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
    vSwitch0 2520 8 128 1500 vmnic0,vmnic1
    
    PortGroup Name VLAN ID Used Ports Uplinks
    VM Network 0 0 vmnic0,vmnic1
    VMkernel-Test 0 1 vmnic0,vmnic1
    vMotion 0 1 vmnic1
    Management Network 0 1 vmnic0,vmnic1
- Check the MTU on the vmnics:
  - [root@esxi2:~] esxcfg-nics -l
    Name PCI Driver Link Speed Duplex MAC Address MTU Description
    vmnic0 0000:0b:00.0 nvmxnet3 Up 10000Mbps Full 00:50:56:##:##:## 1500 VMware Inc. vmxnet3 Virtual Ethernet Controller
    vmnic1 0000:13:00.0 nvmxnet3 Up 10000Mbps Full 00:50:56:##:##:## 1500 VMware Inc. vmxnet3 Virtual Ethernet Controller
    vmnic2 0000:1b:00.0 nvmxnet3 Up 10000Mbps Full 00:50:56:##:##:## 1500 VMware Inc. vmxnet3 Virtual Ethernet Controller
    vmnic3 0000:04:00.0 nvmxnet3 Up 10000Mbps Full 00:50:56:##:##:## 1500 VMware Inc. vmxnet3 Virtual Ethernet Controller
- The above checks indicates that vmnic1 is associated with vmk1 vMotion uplink traffic and the set for 1500 MTU.

2. Check for packet drops by performing the vmkping on the vMotion network using 1472 MTU.

- vmkping -I <vmotion_vmk> <vMotion_IP_of_another_host> -d -s 1472 -c 300
  - Example : vmkping -I vmk1 ###.###.#.### -d -s 1472 -c 100
  - PING ###.###.#.### (###.###.#.###): 1472 data bytes
    1480 bytes from ###.###.#.###: icmp_seq=0 ttl=64 time=1.517 ms
    1480 bytes from ###.###.#.###: icmp_seq=2 ttl=64 time=0.508 ms
    1480 bytes from ###.###.#.###: icmp_seq=3 ttl=64 time=0.634 ms
    
    --- ###.###.#.### ping statistics ---
    10 packets transmitted, 10 packets received, 75% packet loss
    round-trip min/avg/max = 0.487/0.685/1.517 ms
  - Notice the Missing Sequence number and packet loss percentage.

Environment

VMware ESXi Version: 7.x
VMware ESXi Version: 8.x

VMware vSAN : 7.x
VMware vSAN : 8.x

Cause

The CRC (Cyclic Redundancy Check) error occurs when data corruption is detected during transmission over a physical network. It happens when the calculated checksum of the received data does not match the expected value, indicating possible data corruption due to issues like faulty cables, SFP, Network Interface Card (NIC) or SAN Switch.

CRC errors are noticed on the vMotion uplink network card.
- Run the following command against the vmnic, which is used for vSAN uplink.
  - $ esxcli network nic stats get -n <vmnic#>
    NIC statistics for vnnic1
    Packets received: 5721419054
    Packets sent: 6897046642
    Bytes received: 2845905140057
    Bytes sent: 4960832527165
    Receive packets dropped: 0
    Transmit packets dropped: 0
    Multicast packets received: 133976174
    Broadcast packets received: 49838376
    Multicast packets sent: 390456
    Broadcast packets sent: 39197
    Total receive errors: 4028
    Receive length errors: 36
    Receive over errors: 0
    Receive CRC errors: 1996
    Receive frame errors: 0
- CRC errors captured under var/run/log/hostd.log.
  - DateTxx:xx:xx.xxxx Wa(164) Hostd[2102792]: [Originator@6876 sub=Statssvc.StatsCollector] Error stats for pnic: vmnic#
    DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- > errorsRx: 178317
    DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- > RxLengthErrors: 2001
    DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- > RxCRCErrors: 88158
    DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- >
    DateTxx:xx:xx.xxxx Wa(164)) Hostd[2102792]: [Originator@6876 sub=Statssvc.StatsCollector] Error stats for pnic: vmnic#
    DateTxx:xx:xx.xxxx Wa(164)) Hostd[2102782]: -- > errorsRx: 178313
    DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- > RxLengthErrors: 2001
    DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- > RxCRCErrors: 88156
    DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- >
The Watch command helps actively refresh the page, and from the below command, one can monitor if the CRC error count increases.

$ watch esxcli network nic stats get -n <vmnic#>

Resolution

This issue is outside vSphere environment, hence involve Server hardware vendor and SAN switch vendor to replace the defective hardware.
Note: The CRC error count will stop once the defective hardware is replaced. A reboot of ESXi server will reset the CRC error count.
Once the issue has been resolved, rerun the vSAN Health Check tests to confirm that the MTU Check (ping with large packet size) warning is no longer present.

Additional Information

Reference article:

vMotion: MTU check (ping with large packet size)

vSAN Health Service - Network Health - Hosts small ping test