Error On vSAN skyline Healthcheck -vMotion: MTU check (ping with large packet size)
search cancel

Error On vSAN skyline Healthcheck -vMotion: MTU check (ping with large packet size)

book

Article ID: 394719

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Introduction:

  • The Maximum Transmission Unit (MTU) health check, also called "MTU check (ping with large packet size)" complements the basic connectivity check for vMotion traffic.

  • The MTU Check warning can be caused due to mismatch of MTU between the vSphere environment and Physical Switch.

    • Example: When the vmknic has an MTU of 9000 and the Physical switch enforced at 1500, the packets are dropped at physical switch when the source does not fragment the packets. 

Symptoms:

  • Below error is notice on skyline health.

    • Cluster > Monitor > Skyline health >

  • Vmotion task takes longer time or fails.

Error skyline health:

Verification:

  1. Identify vmkernel, vmnic used for vMotion traffic and its MTU size configured.

    • Make sure the MTU is consistently configured across the cluster.

    • Identify the vMotion network on the ESXi host.

      • [root@esxi2:~] esxcfg-vmknic -l | grep vMotion
        vmk1       vMotion                                 IPv4      ###.###.##.###                           255.255.255.0   ###.###.#.###  00:50:56:##:##:## 1500    65535     true    STATIC              defaultTcpipStack
        vmk1       vMotion                                 IPv6      fe80::250:####:####:####                64                              00:50:56:##:##:## 1500    65535     true    STATIC, PREFERRED   defaultTcpipStack
      • The above checks indicates that vmk1 is used for vMotion traffic.

    • Check the MTU setting on the vSwitch.

      • [root@esxi2:~] esxcfg-vswitch -l

        Switch Name      Num Ports   Used Ports  Configured Ports  MTU     Uplinks
        vSwitch0         2520        8           128               1500    vmnic0,vmnic1

          PortGroup Name                            VLAN ID  Used Ports  Uplinks
          VM Network                                0        0           vmnic0,vmnic1
          VMkernel-Test                             0        1           vmnic0,vmnic1
          vMotion                                   0        1           vmnic1
          Management Network                        0        1           vmnic0,vmnic1

    • Check the MTU on the vmnics:

      • [root@esxi2:~] esxcfg-nics -l
        Name    PCI          Driver      Link Speed      Duplex MAC Address       MTU    Description
        vmnic0  0000:0b:00.0 nvmxnet3    Up   10000Mbps  Full   00:50:56:##:##:## 1500   VMware Inc. vmxnet3 Virtual Ethernet Controller
        vmnic1  0000:13:00.0 nvmxnet3    Up   10000Mbps  Full   00:50:56:##:##:## 1500   VMware Inc. vmxnet3 Virtual Ethernet Controller
        vmnic2  0000:1b:00.0 nvmxnet3    Up   10000Mbps  Full   00:50:56:##:##:## 1500   VMware Inc. vmxnet3 Virtual Ethernet Controller
        vmnic3  0000:04:00.0 nvmxnet3    Up   10000Mbps  Full   00:50:56:##:##:## 1500   VMware Inc. vmxnet3 Virtual Ethernet Controller
    • The above checks indicates that vmnic1 is associated with vmk1 vMotion uplink traffic and the set for 1500 MTU.

2. Check for packet drops by performing the vmkping on the vMotion network using 1472 MTU.

    • vmkping -I <vmotion_vmk> <vMotion_IP_of_another_host> -d -s 1472 -c 300

      • Example : vmkping -I vmk1 ###.###.#.### -d -s 1472 -c 100

      • PING ###.###.#.### (###.###.#.###): 1472 data bytes
        1480 bytes from ###.###.#.###: icmp_seq=0 ttl=64 time=1.517 ms
        1480 bytes from ###.###.#.###: icmp_seq=2 ttl=64 time=0.508 ms
        1480 bytes from ###.###.#.###: icmp_seq=3 ttl=64 time=0.634 ms

        --- ###.###.#.### ping statistics ---
        10 packets transmitted, 10 packets received, 75% packet loss
        round-trip min/avg/max = 0.487/0.685/1.517 ms

      • Notice the Missing Sequence number and packet loss percentage.

 

Environment

VMware ESXi Version: 7.x
VMware ESXi Version: 8.x

VMware vSAN : 7.x
VMware vSAN : 8.x

Cause

The CRC (Cyclic Redundancy Check) error occurs when data corruption is detected during transmission over a physical network. It happens when the calculated checksum of the received data does not match the expected value, indicating possible data corruption due to issues like faulty cables, SFP, Network Interface Card (NIC) or SAN Switch.

  • CRC errors are noticed on the vMotion uplink network card.
    • Run the following command against the vmnic, which is used for vSAN uplink.

      • $ esxcli network nic stats get -n <vmnic#>
        NIC statistics for vnnic1
        Packets received: 5721419054
        Packets sent: 6897046642
        Bytes received: 2845905140057
        Bytes sent: 4960832527165
        Receive packets dropped: 0
        Transmit packets dropped: 0
        Multicast packets received: 133976174
        Broadcast packets received: 49838376
        Multicast packets sent: 390456
        Broadcast packets sent: 39197
        Total receive errors: 4028
        Receive length errors: 36
        Receive over errors: 0
        Receive CRC errors: 1996
        Receive frame errors: 0
    • CRC errors captured under var/run/log/hostd.log.

      • DateTxx:xx:xx.xxxx Wa(164) Hostd[2102792]: [Originator@6876 sub=Statssvc.StatsCollector] Error stats for pnic: vmnic#
        DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- > errorsRx: 178317
        DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- > RxLengthErrors: 2001
        DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- > RxCRCErrors: 88158
        DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- >
        DateTxx:xx:xx.xxxx Wa(164)) Hostd[2102792]: [Originator@6876 sub=Statssvc.StatsCollector] Error stats for pnic: vmnic#
        DateTxx:xx:xx.xxxx Wa(164)) Hostd[2102782]: -- > errorsRx: 178313
        DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- > RxLengthErrors: 2001
        DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- > RxCRCErrors: 88156
        DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- >

  • The Watch command helps actively refresh the page, and from the below command, one can monitor if the CRC error count increases.

$ watch esxcli network nic stats get -n <vmnic#>

Resolution

  • This issue is outside vSphere environment, hence involve Server hardware vendor and SAN switch vendor to replace the defective hardware.

  • Note: The CRC error count will stop once the defective hardware is replaced. A reboot of ESXi server will reset the CRC error count.

  • Once the issue has been resolved, rerun the vSAN Health Check tests to confirm that the MTU Check (ping with large packet size) warning is no longer present.

Additional Information