Q: What does the Network Health – VSAN Cluster Partition check do?
In order to function properly, all vSAN hosts should be able to communicate properly with each other via the vSAN Network
If all ESXi hosts in the cluster cannot communicate, a vSAN cluster will split into multiple network partitions. (For example sub-groups of ESXi hosts that can talk to each other, but not to other sub-groups).
When this occurs, vSAN objects may become unavailable until the network misconfiguration is resolved. For smooth operations of production vSAN clusters, it is very important to have a stable network with no extra network partitions (For example: Only one partition).
This health check examines the cluster to see how many partitions exist. It displays an error if there is more than a single partition in the vSAN cluster. Note that this check really determines if there is a network issue, but does not attempt to find a root cause. Other network health checks are required to find the root cause.
Q: What does it mean when it is in an error state?
This health check is said to be OK when only a single partition is found. As soon as multiple partitions are discovered, the cluster is considered unhealthy.
There are likely to be other warnings displayed in the vSphere Web Client when a multiple partition issue occurs. For example, the network configuration status in the vSAN General view is likely to state network misconfiguration detected.
Another interesting view is the vSAN Disk Management. This contains a column that provides information on the network partition group to which the ESXi host belongs. To see how many partitions the cluster has been split into, examine this column. If each ESXi host is in its own network partition group, then there is a cluster-wide issue. If only one ESXi host is in its own network partition group and all other ESXi hosts are in a different network partition group, then only that ESXi host has the issue. This may help to isolate the issue at hand and focus on the investigation effort.
Note: The health User Interface displays the same information in the details section of this check.
Q: How does one troubleshoot and fix the error state?
The network configuration issue needs to be located and resolved. Additional health service checks on the network are designed to assist you on finding the root cause of what may be causing the network partition. The reasons can range from mis-configured subnets (all ESXi hosts must have matching subnets), mis-configured vSAN traffic VMkernel adapters (all ESXi hosts must have a VSAN vmknic configured), mis-configured VLANs or general network communication issues, to specific multicast issues (all ESXi hosts have matching multicast settings). The additional network health checks are designed to isolate which of those issues may be the root cause, and should be viewed in parallel with this health check. If the current environment setup is a stretched cluster, refer to the
vSAN Stretched Cluster Configuration Guide to see if any additional static routes are required.
Aside from mis-configurations, it is also possible to have partitions when the network is overloaded, leading to substantial dropped packets. vSAN can tolerate a small amount of dropped packets but once there is above a medium amount of dropped packets, performance issues may ensue.
If none of the misconfiguration checks indicate an issue, it is advisable to watch for dropped packet counters, as well as perform a pro-active network performance test. Proactive network performance tests, which may be initiated from RVC, are discussed in the vSAN Health Services Guide.
To examine the dropped packet counters on an ESXi host, use the esxtop network view (press n) and examine the field %DRPRX for excessive dropped packets. You may also need to watch the switch and switch ports, as they may also drop packets. Another metric that should be checked for, is an excessive amount of pause frames that can slow down the network and impact performance.