Error On vSAN skyline Healthcheck -vMotion: MTU check (ping with large packet size)
search cancel

Error On vSAN skyline Healthcheck -vMotion: MTU check (ping with large packet size)

book

Article ID: 394719

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

  • Below error is noticed on skyline health.

    vCenter login GUI > Cluster > Monitor > Skyline health

    vMotion: MTU check (ping with large packet size)

Error skyline health:

  • Vmotion task takes longer time or fails with any of the errors below

Error 195887167. Connection closed by remote host, possibly due to timeout.

The migration was canceled because the amount of changing memory for the virtual machine was greater than the available network bandwidth. Attempt the migration again when the virtual machine is not as busy or more network bandwidth is available.

vMotion migration [########:########] failed to read stream keepalive: Connection closed by remote host, possibly due to timeout
  • Packet drops are observed while performing the VMKPING on the vMotion network using 8900 MTU.

    • Login to the ESXi host as user root and run the command below

      vmkping -I <vmotion_vmk> <vMotion_IP_of_another_host> -d -s 8900-c 30

      -I represents the VMkernel Adapter Interface
      -d Sets the Don't Fragment bit
      -s Sets the MTU packet size.

Example : vmkping -I vmk1 ###.###.#.### -d -s 8900 -c 100

PING ###.###.#.### (###.###.#.###): 8900 data bytes
8900 bytes from ###.###.#.###: icmp_seq=0 ttl=64 time=1.517 ms
8900 bytes from ###.###.#.###: icmp_seq=2 ttl=64 time=0.508 ms
8900 bytes from ###.###.#.###: icmp_seq=3 ttl=64 time=0.634 ms
--- ###.###.#.### ping statistics ---
10 packets transmitted, 10 packets received, 75% packet loss
round-trip min/avg/max = 0.487/0.685/1.517 ms

Refer to the KB to identity the vMotion Network IP configuration > Troubleshooting vMotion network connectivity issues

  • CRC errors are found on the vmnic that is used for vSAN uplink and vMotion uplink.

    • Login to the ESXi host as user root and run the command esxtop and press n key, notice on which vmnic are vSAN and vMotion interfaces pointed to.
    • Run the following command on the corresponding vSAN or vMotion VMNIC interface.

$ esxcli network nic stats get -n <vmnic#>
NIC statistics for vnnic1
Packets received: 5721419054
Packets sent: 6897046642
Bytes received: 2845905140057
Bytes sent: 4960832527165
Receive packets dropped: 0
Transmit packets dropped: 0
Multicast packets received: 133976174
Broadcast packets received: 49838376
Multicast packets sent: 390456
Broadcast packets sent: 39197
Total receive errors: 4028
Receive length errors: 36
Receive over errors: 0
Receive CRC errors: 1996
Receive frame errors: 0

  • CRC errors captured under /var/run/log/hostd.log

      • DateTxx:xx:xx.xxxx Wa(164) Hostd[2102792]: [Originator@6876 sub=Statssvc.StatsCollector] Error stats for pnic: vmnic#
        DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- > errorsRx: 178317
        DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- > RxLengthErrors: 2001
        DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- > RxCRCErrors: 88158
        DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- >
        DateTxx:xx:xx.xxxx Wa(164)) Hostd[2102792]: [Originator@6876 sub=Statssvc.StatsCollector] Error stats for pnic: vmnic#
        DateTxx:xx:xx.xxxx Wa(164)) Hostd[2102782]: -- > errorsRx: 178313
        DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- > RxLengthErrors: 2001
        DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- > RxCRCErrors: 88156
        DateTxx:xx:xx.xxxx Wa(164) Hostd[2102782]: -- >

Environment

VMware ESXi Version: 7.x
VMware ESXi Version: 8.x

VMware vSAN : 7.x
VMware vSAN : 8.x

Cause

  • The MTU Check warning is caused due to mismatch of MTU between the vSphere environment and Physical Switch.

  • The Maximum Transmission Unit (MTU) health check, also called "MTU check (ping with large packet size)" complements the basic connectivity check for vMotion traffic.

  •  When the VMKernel Adapter has an MTU of 9000 across the ESXi Hosts, and the Physical Switch is configured with 1500, the packets are dropped at physical switch when the source does not fragment the packets. 

  • The CRC (Cyclic Redundancy Check) error occurs when data corruption is detected during transmission over a physical network.

  • It happens when the calculated checksum of the received data does not match the expected value, indicating possible data corruption due to issues like faulty cables, SFP, Network Interface Card (NIC) or Physical Switch.

Resolution

  • The Maximum Transmission Unit (MTU) size must be uniformly configured across the ESXi host's VMkernel adapters, the virtual switch, the VMNIC physical NIC adapters, and the entire path through the physical switch infrastructure to prevent performance-impacting packet fragmentation and ensure smooth communication, especially when using jumbo frames.

    • Run the following commands to confirm the MTU size and the ping response on the corresponding VMkernel adapters.

To validate the MTU on the physical NICs :- esxcfg-nics -l
To validate the MTU on the VMkernel Adapters :- esxcfg-vmknic -l
To validate the MTU on the virtual switch :-  esxcfg-vswitch -l
To validate the ping response across the VMkernel adapters with the appropriate MTU size :- vmkping -I <vmotion_vmk OR vSAN_vmk> <vMotion or VSAN_IP_of_another_host> -d -s <MTU size> -c 30

  • Since the Cyclic Redundancy Check (CRC) issue occurs outside the vSphere environment (i.e., at the physical layer), involve the Server Hardware Vendor and the Physical Switch Vendor to identify the issue.

  • Once the issue has been resolved, rerun the vSAN Health Check tests to confirm that the MTU Check (ping with large packet size) warning is no longer present.

Additional Information