vMotion Migration Fails with Timeout Due to MTU Size Mismatch
search cancel

vMotion Migration Fails with Timeout Due to MTU Size Mismatch

book

Article ID: 417889

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This article addresses a common issue where a VMware vMotion operation fails with a timeout error, directly attributed to a Maximum Transmission Unit (MTU) size mismatch across the network path used for vMotion traffic. This typically occurs when Jumbo Frames are configured inconsistently, leading to packet drops and severe performance degradation during the high-volume data transfer required for vMotion.

vMotion operation fails with a generic network error or specifically mentions a timeout.

  • Error messages in vCenter or ESXi host logs (e.g., /var/log/vmkernel.log, /var/log/hostd.log) may indicate "connection timed out," "lost connection to peer," or issues with network packets during migration. 
  • vmkping tests with large packet sizes (jumbo frames) fail to reach the destination ESXi host's vMotion VMkernel adapter.
  • Increased vMotion duration that eventually times out.

Environment

  • vCenter Server 8.0
  • ESXi 7.0 hosts being migrated to a new environment

Cause

vMotion is a network-intensive operation, and for performance reasons, many environments configure Jumbo Frames (an MTU of 9000 bytes) on their vMotion network.

The core problem arises when Jumbo Frames are enabled on some components in the vMotion path but not on all of them, leading to an MTU mismatch. 

When an MTU Mismatch Occurs:  If a network device (ESXi VMkernel adapter, vSwitch, physical switch, router) in the vMotion path has an MTU smaller than the packets being sent (e.g., 1500 MTU device in a 9000 MTU path), the packet will either be fragmented or, more commonly, dropped. 

In protocols like TCP/IP with the "Don't Fragment" (DF) bit set (which is often the case for large data transfers like vMotion), packets exceeding the MTU of an intermediate device are dropped without an ICMP "Fragmentation Needed" message being returned. 

Possible impacts could be excessive packet drops lead to retransmissions, overwhelming the network, consuming processing power, and ultimately causing the vMotion operation to exceed its timeout threshold

Resolution

Resolving this issue requires identifying the MTU mismatch point and ensuring consistent MTU settings across the entire vMotion network path.

Identify MTU Configuration Discrepancies CRITICAL: Perform these checks on BOTH the source and destination ESXi hosts and all intermediate network devices.

  1. Check vMotion VMkernel Adapter MTU:
    1. vCenter UI: Navigate to Host > Configure > VMkernel adapters. Select the vMotion VMkernel adapter (e.g., vmk1 or vmk-vMotion), click Edit, and check the MTU size.
    2. ESXi CLI (SSH to host): bash esxcli network ip interface list | grep -A 5 "Name: <vmotion_vmk_name>" Look for the MTU value (e.g., MTU: 9000).

  2. Check vSwitch MTU (for the vMotion Port Group):
    1. vDS (vCenter UI): Navigate to Networking > [ Distributed Switch]. Go to Configure > Properties and check the MTU value for the vDS. This MTU applies to all uplinks and port groups on the vDS.
    2. vSS (ESXi CLI): bash esxcli network vswitch standard list | grep -A 5 "Name: <vSwitch_name>" Look for the MTU value of the standard switch carrying vMotion traffic.

  3. Check Physical Network Switch Ports:
    1. CLI of physical switch: Log in to the physical switch(es) connected to the ESXi host uplinks used for vMotion.
    2. Check the configuration of the relevant physical interfaces (access ports, trunk ports, port-channels).
    3. Look for MTU settings (mtu 9000 or equivalent).
    4. Check if a global Jumbo Frame setting is enabled and configured correctly.

  4. Check Routers (if L3 vMotion):
    1. If vMotion is occurring across subnets (Layer 3), ensure all routers in the path have their interfaces configured to support the desired MTU (e.g., 9000).



Perform MTU Path Discovery (vmkping) This step helps pinpoint the exact MTU size that the path can support.

  1. SSH to one of the ESXi hosts (e.g., source host).

  2. Run vmkping with the "Don't Fragment" bit and varying packet sizes.
    1. Start with a large packet size (e.g., 8972 bytes) and the vMotion VMkernel adapter: bash vmkping -d -s 8972 -I <vmotion_vmk_name> <destination_vmotion_ip> (Replace <vmotion_vmk_name> with your vMotion VMkernel adapter name, e.g., vmk1, and <destination_vmotion_ip> with the IP of the vMotion VMkernel adapter on the target host).
    2. If this fails: Gradually reduce the packet size (e.g., to 8000, 7000, 1500, then to 1472 for standard 1500 MTU without headers) until the vmkping succeeds.
    3. The largest packet size that succeeds + 28 bytes (for IP/ICMP headers) is the effective MTU of the path.
    4. Example for standard MTU: vmkping -d -s 1472 -I <vmotion_vmk_name> <destination_vmotion_ip> (if this works, the path's effective MTU is 1500).

  3. Repeat from the destination ESXi host to the source ESXi host. This confirms bi-directional path MTU.


 Ensure Consistent MTU WARNING: Modifying MTU settings can disrupt network connectivity if not applied consistently across the entire path. Perform during a maintenance window or with caution.

  1. Decide on a Consistent MTU:

    a. The simplest and often recommended approach is to enable Jumbo Frames (MTU 9000) end-to-end: This requires all components in the vMotion path to support and be configured for MTU 9000.
    b. Revert to Standard MTU (1500) end-to-end: If Jumbo Frames are not strictly necessary or problematic to implement consistently.

    2. Apply Consistent MTU Settings (on all components in the path)

  1.  VMkernel Adapter MTU: vCenter UI: Edit the vMotion VMkernel adapter and set the MTU to the desired consistent value (e.g., 9000 or 1500).
  2. ESXi CLI: bash esxcli network ip interface set -i <vmotion_vmk_name> -m <desired_mtu>   vSwitch MTU:
  3. vDS (vCenter UI): Navigate to Networking > [Your Distributed Switch]. Go to Configure > Properties and set the MTU to the desired value.
  4. vSS (ESXi CLI): bash esxcli network vswitch standard set -v <vSwitch_name> -m <desired_mtu>
  5. Physical Network Switch Ports/Global MTU: Configure all relevant switch interfaces and global MTU settings to match the desired MTU.
  6.  Routers (if L3): Configure all router interfaces in the vMotion path to the desired MTU.
  7. Save/Apply Changes: Ensure all changes are saved and applied on all devices. A reboot of ESXi hosts is usually not required for MTU changes, but physical switch changes might require saving and sometimes a module restart depending on the vendor.



    Re-run vmkping tests : 
    Confirm that vmkping -d -s <desired_mtu - 28> -I <vmotion_vmk_name> <destination_vmotion_ip> now succeeds in both directions

    1. Attempt a Test vMotion
    2. Initiate a vMotion of a non-critical VM between the affected ESXi hosts.
    3. Monitor the vMotion task to ensure it completes successfully and within expected timeframes.


    Note :  Dedicated vMotion Network: Always use a dedicated VMkernel port, VLAN, and IP subnet for vMotion traffic to isolate it and simplify troubleshooting.
    High Bandwidth: Provide sufficient network bandwidth (10Gbps or higher) for vMotion.
    NTP Synchronization: Ensure all ESXi hosts and vCenter Server are synchronized with an accurate Network Time Protocol (NTP) source. Time discrepancies can lead to vMotion failures.
    Redundancy: Implement NIC teaming for vMotion-enabled VMkernel ports for redundancy and increased bandwidth.


Additional Information

vMotion Network Requirements

Troubleshooting vMotion issues

Troubleshooting vMotion fails with network errors