UDP Packet Drops or Fragmentation in Tanzu/Antrea Environments Due to MTU Mismatches
search cancel

UDP Packet Drops or Fragmentation in Tanzu/Antrea Environments Due to MTU Mismatches

book

Article ID: 441548

calendar_today

Updated On:

Products

VMware Container Networking with Antrea

Issue/Introduction

Users might notice that UDP packets (specifically those around 1500 bytes) are dropping or failing to reach Pods. This usually surfaces when Pods try to communicate with external Application (AP) servers over UDP.

When looking into the environment, you'll likely spot a network mismatch: the underlying host/Node MTU is set to 1500 bytes, but the Antrea Pod interface MTU is capped at 1450 bytes

Environment

VMware Container Networking with Antrea

Cause

  • By default, Antrea uses Geneve overlay encapsulation for inter-node traffic. While the Geneve header itself adds 50 bytes, the total overhead (including outer headers) often requires a minimum MTU of 1600 bytes on the physical path to safely route traffic. Our baseline recommendation is actually 1700 bytes to future-proof for additional header extensions.

  • The antrea-agent looks at the Kubernetes Node's MTU and subtracts 50 to accommodate that tunnel header. So, if your host Node MTU is strictly 1500, the Pod gets an MTU of 1450.

  • Unlike TCP, UDP doesn't natively negotiate segment sizes (MSS) with a handshake. If an external server sends a full 1500-byte UDP packet to a Pod that can only handle 1450 bytes, the packet has to be fragmented.

  • If the application or the OS sets the "Don't Fragment" (DF) flag on that UDP socket, fragmentation is completely blocked. This behavior is dictated by socket-specific options explicitly configured by the application (explained under IP_MTU_DISCOVER in the Linux ip(7) Man Page) or by system-wide kernel parameters (such as net.ipv4.ip_no_pmtu_disc). When blocked, the oversized packet is dropped, and the system throws an ICMP "Fragmentation Needed" message.

Resolution

To get those 1500-byte packets through without dropping, you'll need to increase the MTU across your physical and virtual network path.

Keep in mind: MTU changes apply globally to all nodes in the workload cluster; they cannot be configured on an individual, per-Pod basis.

Step 1: Increase Node and Fabric MTU Bump the MTU on your physical switches, the NSX overlay, and the VKS cluster nodes to at least 1600 bytes (though 1700 bytes is highly recommended).

  • For users running AI, Storage, or Big Data workloads on Jumbo Frames (MTU 9000), simply matching 1500 is not enough. We recommend setting the Guest/Pod MTU to 8900. This maximizes throughput while leaving plenty of room for the encapsulation overhead.

  • If your environment uses a vSphere Distributed Switch (VDS), you cannot manually change the MTU directly on the vmk10 or vmk11 adapters. The change must be applied at the VDS level under Configure > Settings > Properties > Advanced.

Step 2: Let Antrea Auto-Configure Once the Node MTU is updated, you don't need to manually configure Antrea. The antrea-agent will automatically pick up the new Node MTU and scale up the Pod MTU globally. We highly recommend letting this auto-discovery do its job rather than forcing a manual override.

Step 3: Alternative Mitigation for Environments Where MTU Cannot Be Changed If you cannot immediately increase the infrastructure MTU, consider these workarounds:

  • For UDP Traffic: Check your application architecture and kernel settings. Ensure the application does not explicitly set the "Don't Fragment" (DF) flag on the UDP socket, allowing the OS to fragment the 1500-byte packets to fit the 1450-byte MTU limit.

  • For TCP Traffic (MSS Clamping): North-South traffic mismatches often occur if intermediate physical links have smaller MTUs. Implement MSS Clamping for all container traffic exiting the cluster. This forces the TCP maximum segment size to safely fit within the structural limits of the network path, mitigating drops during TCP handshakes.

Step 4: Verify the Changes and Test the Path To isolate whether the drop is still happening on the host or along the path, run the following diagnostic checks:

  • From inside a Pod/Node (Test Path MTU): Run a ping with the "Do Not Fragment" flag to confirm the exact drop point.

    ping -c 4 -M do -s 1472 [Destination_IP]
    

    (Note: 1472 bytes payload + 28 bytes ICMP/IP header = 1500 bytes total)

  • From the ESXi Host (Test TEP Path directly): Ensure you include the vxlan netstack parameter to test the actual overlay path.

    vmkping ++netstack=vxlan -I vmk10 -s 1572 -d [Remote_TEP_IP]
    
  • From inside a Pod container (Check local interface limits):

    ip link show eth0