All VM's in an "invalid state" due to vSAN network partition

Products

VMware vSAN

Issue/Introduction

vSAN cluster is in a network partitioned state and the VMs running on the vSAN datastore show up as invalid under the host gui as they no longer have access to the backing objects.
Accessing a host directly from the vSphere web client, the VMs show invalid as well:

Using SSH, login to the ESXi hosts using root

When running the command, esxcli vsan cluster get, all hosts show Sub-Cluster Member Count: 1

[root@ESXI1:~] esxcli vsan cluster get
Cluster Information
Enabled: true
Current Local Time: 2025-03-11T15:56:02Z
Local Node UUID: 67d03e4d-81fa-2bbf-74b6-############
Local Node Type: NORMAL
Local Node State: MASTER
Local Node Health State: HEALTHY
Sub-Cluster Master UUID: 67d03e4d-81fa-2bbf-74b6-############
Sub-Cluster Backup UUID:
Sub-Cluster UUID: 526b3c96-a22f-5025-d80d-############
Sub-Cluster Membership Entry Revision: 0
Sub-Cluster Member Count: 1
Sub-Cluster Member UUIDs: 67d03e4d-81fa-2bbf-74b6-############
Sub-Cluster Member HostNames: ESXI1
Sub-Cluster Membership UUID: 805ad067-1018-5eff-bf26-############
Unicast Mode Enabled: false
Maintenance Mode State: OFF
Config Generation: None 0 0.0
Mode: REGULAR
vSAN ESA Enabled: true

Cluster partition error in the Skyline health on the vSphere.

Environment

VMware vSAN (all versions)

Cause

When hosts are not able to pass non fragmented packets at the correct MTU configured on the vSAN network to the rest of the vSAN cluster this will cause the cluster to network partition as shown in the below screenshot. Please review the following KB on how to Test the VMkernel network connectivity with the vmkping command to verify correct MTU configuration.

In the screen shot, the first line shows 100% pack lost using MTU size 8872 and 8850. While using an MTU size of 8825, the ping response has no packet loss.

Resolution

In order for the vSAN cluster to form and handle leader changes and not network partition, the hosts will need to be able to pass a non fragmented packet at the correct MTU. You have two options to correct this issue.

Work with your Networking team to identify the cause of the MTU mismatch and correct it on the physical network.
Change MTU size on the ESXI hosts to allow for the passing of non fragmented packets on the existing network configuration. Such as if you are not able to pass an MTU of 9000 but 1500 passes, change the MTU of the ESXI hosts to 1500. As vCenter is invalid this will have to be done via the CLI.

1. Identify the Network Interface vSAN uses via esxcli vsan network list :

[root@server name:~] esxcli vsan network list
Interface
VmkNic Name: vmk1
IP Protocol: IP
Interface UUID: ########-####-####-####-############
Agent Group Multicast Address: 224.2.3.4
Agent Group IPv6 Multicast Address: ff19::2:3:4
Agent Group Multicast Port: 23451
Master Group Multicast Address: 224.1.2.3
Master Group IPv6 Multicast Address: ff19::1:2:3
Master Group Multicast Port: 12345
Host Unicast Channel Bound Port: 12321
Data-in-Transit Encryption Key Exchange Port: 0
Multicast TTL: 5
Traffic Type: vsan

2. Use the esxcli network ip interface set command to change the MTU:

The general syntax is: esxcli network ip interface set --interface-name=[vmkernel_interface_name] --mtu=[new_MTU]

Example: esxcli network ip interface set --interface-name=vmk1 --mtu=1500

This command will set the MTU for the network interface named vmk1 to 1500.

Reference kbs

vSAN Health Service - Network Health - vSAN Cluster Partition