vSAN Health Service - Network Health - Hosts small ping test (connectivity check) and Hosts large ping test (MTU check)
search cancel

vSAN Health Service - Network Health - Hosts small ping test (connectivity check) and Hosts large ping test (MTU check)

book

Article ID: 326823

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

This article explains the Network Health - Hosts small ping test (connectivity check) and the Network Health - Hosts large ping test (MTU check) check in the vSAN Health Service and provides details on why it might report an error.

Environment

VMware vSAN 6.0.x
VMware vSAN 7.0.x

Resolution

Q: What does the Network Health - Hosts small ping test (connectivity check) check and Hosts large ping test (MTU check) check do?

While most other network related vSAN health checks assess various aspects of the network configuration, this health check takes a more active approach. As vSAN is not able to check the configuration of the physical network, one way to ensure that IP connectivity exists among all ESXi hosts in the vSAN cluster is to simply ping each ESXi host on the vSAN network from each other ESXi host.

The Hosts small ping test (connectivity check) health check automates the pinging of each ESXi host from each of the other ESXi hosts in the vSAN cluster, and ensures that there is connectivity between all the ESXi hosts on the vSAN network. In this test, all nodes ping all other nodes in the cluster.

The Hosts large ping test (MTU check) health check complements the basic ping connectivity check. MTUs, the Maximum Transmission Unit size, are increased to improve network performance. Incorrectly configured MTUs will frequently not show up as a vSAN network partition, but instead cause performance issues or I/O errors in individual objects. It can also lead to virtual machine deployment failures on vSAN. For stability of vSAN clusters, it is critically important for the large ping test check to succeed.

While the basic check used small packets, the large packet check uses large packets (9000 bytes). These are often referred to as jumbo frames. Assuming the small ping test succeeds, the large ping test should also succeed when the MTU size is consistently configured across all VMkernel adapters (vmknics), virtual switches and any physical switches.

Note: If the source vmknic has an MTU of 1500, it will fragment the 9000 byte packet, and then those fragments will travel perfectly fine along the network to the other ESXi host where they are reassembled. As long as all network devices along the path use a higher or equal MTU, then this test passes.

What can cause a failure is if the vmknic has an MTU of 9000 and then the physical switch enforces an MTU of 1500. This is because the source does not fragment the packet and the physical switch will drop the packet.

However, if there is an MTU of 1500 on the vmknic and an MTU 9000 on the physical switch (for example, there is also an iSCSI running which is using 9000) then there is no issue and the test passes.

vSAN does not care if it is set to 1500 or 9000, as long as it is consistently configured across the cluster.

Q: What does it mean when it is in an error state?

If the small ping tests fail, it indicates that the network is misconfigured. The test sends 3 pings. If one ping is lost, the check considers this a failure. This could be caused by many factors, and the issue may be in the virtual network (vmknic, virtual switch) or the physical network (cable, physical NIC, physical switch). The other network health check results should be examined to narrow down the root cause of the misconfiguration. If all the other health checks indicate a good ESXi side configuration, the issue may reside in the physical network.

This ping test is performed using very small packets, so it ensures basic connectivity.

If the large ping test fails, it means that there is an MTU misconfiguration somewhere in the vSAN network. The source of the misconfiguration will need to be traced. It could be the VMkernel adapters, the virtual switches, or the physical network switches.
  1. Make sure the MTU is consistently configured across the cluster.
  2. If the default MTU of 1500 is not changed on data nodes or on the witness appliance, then the error message means the test that failed sends a 9000 byte packet over the network. If the MTU is 1500 and the test fails then it means that somewhere in the network there is something that has a Don't Fragment flag set. Applications are free to send packets of any size over the network and it is the responsibility of the network to deliver those packets. Normally Don't Fragment is NOT SET. If DF is not set and an application sends a packet which is larger than the MTU then that packet is fragmented into one or more packets of MTU Size or smaller, and those fragments are reassembled on the remote end. If the DF is set then it means that if any application attempts to send a packet that is larger than the MTU then the packet cannot be fragmented, and the packet cannot go through. For such case, it's recommended to clear the Don't Fragment flag for everywhere. If clear DF is not an option, reach to VMware Support for further evaluation before silencing the health check.

Q: How does one troubleshoot and fix the error state?

1. Identify the VMkernel port (vmknic) being used by vSAN.

esxcli vsan network list



2. Perform small packet ping test
Ping another vSAN node in the cluster using the vmknic found in step one.

vmkping -I vmk# <vSAN Node>



3. Perform a large packet ping test

vmkping -I vmk1  -s 8972 <vSAN Node>

Note: If the MTU in use for vSAN traffic is 1500 and the test fails then it means that somewhere in the network there is something that has a Don't Fragment flag set. It's recommended to clear the Don't Fragment flag everywhere along the network path. If clear DF is not an option, collect support bundles of vCenter, ESXi hosts, and NSX if it's applicable, and then reach out to VMware Support for further evaluation before silencing the health check.

4. If using jumbo frames, test the do-not-fragment "-d" switch, else this can be skipped.

vmkping -I vmk1 -d -s 8972 <vSAN Node>

Note: the -d sets the do not fragment option on the vmkping command. If this option is not used, the packet will be fragmented and will not provide valid results.


If you see the following, either jumbo frames is not enabled or is incorrectly configured. Jumbo frames need to be enabled end to end.


Additional Information

For more information on collecting VMware vSAN logs, see Collecting vSAN support logs and uploading to VMware (2072796).

Also, see:

VMware vSAN Design Guide
vSAN Health Service - Cluster Health - vSAN Health Service up-do-date
vSAN Health Service - Cluster Health - Advanced vSAN configuration in sync
vSAN Health Service - Network Health - Hosts disconnected from vCenter Server
vSAN Health Service - Network Health - Unexpected vSAN cluster members
vSAN Health Service - Network Health - vSAN Cluster Partition
vSAN Health Service - Network Health – Hosts with vSAN disabled
vSAN Health Service - Network Health - All hosts have a vSAN vmknic configured
vSAN Health Service - Network Health - All hosts have matching subnets
vSAN Health Service - Network Health - Hosts with connectivity issues
vSAN Health Service - Data Health – vSAN Object Health
vSAN Health Service - Physical Disk Health - Metadata Health
vSAN Health Service - Physical Disk Health - Overall Disk Health
vSAN Health Service - Limits Health – Current Cluster Situation
vSAN Health Service - Limits Health – After one additional host failure
vSAN Health Service - Physical Disk Health - Disk Capacity
vSAN Health Service – Physical Disk Health – Software State Health
vSAN Health Service – Physical Disk Health – Component Metadata Health
vSAN Health Service - Physical Disk Health – Congestion
vSAN Health Service - Physical Disk Health – Memory pools
vSAN Health Service - vSAN HCL Health - Controller Release Support
vSAN Health Service – vSAN HCL Health – Controller Driver
vSAN Health Service - vSAN HCL Health – vSAN HCL DB up-to-date
vSAN Health Service - vSAN HCL Health – SCSI Controller on vSAN HCL
vSAN Health Service - Cluster Health – CLOMD liveness check
vSAN Health Service - Cluster Health - vSAN Health service installation
vSAN Health Check Information
vSAN Health Service - Network Health - Active Multicast connectivity check
Virtual SAN 运行状况服务 - 网络运行状况 - 主机小数据包 ping 测试(连接检查)和主机大数据包 ping 测试(MTU 检查)
Virtual SAN 健全性サービス - ネットワーク健全性 - ホストの小規模 Ping テスト(接続チェック)とホストの大規模 Ping テスト(MTU チェック)