Q: What does the Network Health - Hosts small ping test (connectivity check) check and Hosts large ping test (MTU check) check do?
While most other network related vSAN health checks assess various aspects of the network configuration, this health check takes a more active approach. As vSAN is not able to check the configuration of the physical network, one way to ensure that IP connectivity exists among all ESXi hosts in the vSAN cluster is to simply ping each ESXi host on the vSAN network from each other ESXi host.
The
Hosts small ping test (connectivity check) health check automates the pinging of each ESXi host from each of the other ESXi hosts in the vSAN cluster, and ensures that there is connectivity between all the ESXi hosts on the vSAN network. In this test, all nodes ping all other nodes in the cluster.
The
Hosts large ping test (MTU check) health check complements the basic ping connectivity check. MTUs, the Maximum Transmission Unit size, are increased to improve network performance. Incorrectly configured MTUs will frequently not show up as a vSAN network partition, but instead cause performance issues or I/O errors in individual objects. It can also lead to virtual machine deployment failures on vSAN. For stability of vSAN clusters, it is critically important for the large ping test check to succeed.
While the basic check used small packets, the large packet check uses large packets (9000 bytes). These are often referred to as jumbo frames. Assuming the small ping test succeeds, the large ping test should also succeed when the MTU size is consistently configured across all VMkernel adapters (vmknics), virtual switches and any physical switches.
Note: If the source vmknic has an MTU of 1500, it will fragment the 9000 byte packet, and then those fragments will travel perfectly fine along the network to the other ESXi host where they are reassembled. As long as all network devices along the path use a higher or equal MTU, then this test passes.
What can cause a failure is if the vmknic has an MTU of 9000 and then the physical switch enforces an MTU of 1500. This is because the source does not fragment the packet and the physical switch will drop the packet.
However, if there is an MTU of 1500 on the vmknic and an MTU 9000 on the physical switch (for example, there is also an iSCSI running which is using 9000) then there is no issue and the test passes.
vSAN does not care if it is set to 1500 or 9000, as long as it is consistently configured across the cluster.
Q: What does it mean when it is in an error state?
If the small ping tests fail, it indicates that the network is misconfigured. The test sends 3 pings. If one ping is lost, the check considers this a failure. This could be caused by many factors, and the issue may be in the virtual network (vmknic, virtual switch) or the physical network (cable, physical NIC, physical switch). The other network health check results should be examined to narrow down the root cause of the misconfiguration. If all the other health checks indicate a good ESXi side configuration, the issue may reside in the physical network.
This ping test is performed using very small packets, so it ensures basic connectivity.
If the large ping test fails, it means that there is an MTU misconfiguration somewhere in the vSAN network. The source of the misconfiguration will need to be traced. It could be the VMkernel adapters, the virtual switches, or the physical network switches.
- Make sure the MTU is consistently configured across the cluster.
- If the default MTU of 1500 is not changed on data nodes or on the witness appliance, then the error message means the test that failed sends a 9000 byte packet over the network. If the MTU is 1500 and the test fails then it means that somewhere in the network there is something that has a Don't Fragment flag set. Applications are free to send packets of any size over the network and it is the responsibility of the network to deliver those packets. Normally Don't Fragment is NOT SET. If DF is not set and an application sends a packet which is larger than the MTU then that packet is fragmented into one or more packets of MTU Size or smaller, and those fragments are reassembled on the remote end. If the DF is set then it means that if any application attempts to send a packet that is larger than the MTU then the packet cannot be fragmented, and the packet cannot go through. For such case, it's recommended to clear the Don't Fragment flag for everywhere. If clear DF is not an option, reach to VMware Support for further evaluation before silencing the health check.
Q: How does one troubleshoot and fix the error state?
1. Identify the VMkernel port (vmknic) being used by vSAN.
esxcli vsan network list2. Perform small packet ping test
Ping another vSAN node in the cluster using the vmknic found in step one.
vmkping -I vmk# <vSAN Node>3. Perform a large packet ping test
vmkping -I vmk1 -s 8972 <vSAN Node>
Note: If the MTU in use for vSAN traffic is 1500 and the test fails then it means that somewhere in the network there is something that has a Don't Fragment flag set. It's recommended to clear the Don't Fragment flag everywhere along the network path. If clear DF is not an option, collect support bundles of vCenter, ESXi hosts, and NSX if it's applicable, and then reach out to VMware Support for further evaluation before silencing the health check.
4. If using jumbo frames, test the do-not-fragment "-d" switch, else this can be skipped.
vmkping -I vmk1 -d -s 8972 <vSAN Node>Note: the -d sets the do not fragment option on the vmkping command. If this option is not used, the packet will be fragmented and will not provide valid results.
If you see the following, either jumbo frames is not enabled or is incorrectly configured. Jumbo frames need to be enabled end to end.