All VM's in an "invalid state" due to vSAN network partition
search cancel

All VM's in an "invalid state" due to vSAN network partition

book

Article ID: 395490

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

  • vSAN cluster is in a network partitioned state and the VMs running on the vSAN datastore show up as invalid under the host gui as they no longer have access to the backing objects. 
  • Accessing a host directly from the vSphere web client, the VMs show invalid as well:

         

 

  • Using SSH, login to the ESXi hosts using root
  • When running the command, esxcli vsan cluster get, all hosts show Sub-Cluster Member Count: 1

[root@ESXI1:~] esxcli vsan cluster get
Cluster Information
   Enabled: true
   Current Local Time: 2025-03-11T15:56:02Z
   Local Node UUID: 67d03e4d-81fa-2bbf-74b6-############
   Local Node Type: NORMAL
   Local Node State: MASTER
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 67d03e4d-81fa-2bbf-74b6-############
   Sub-Cluster Backup UUID:
   Sub-Cluster UUID: 526b3c96-a22f-5025-d80d-############
   Sub-Cluster Membership Entry Revision: 0
   Sub-Cluster Member Count: 1
   Sub-Cluster Member UUIDs: 67d03e4d-81fa-2bbf-74b6-############
   Sub-Cluster Member HostNames: ESXI1
   Sub-Cluster Membership UUID: 805ad067-1018-5eff-bf26-############
   Unicast Mode Enabled: false
   Maintenance Mode State: OFF
   Config Generation: None 0 0.0
   Mode: REGULAR
   vSAN ESA Enabled: true

  • Cluster partition error in the Skyline health on the vSphere.

Environment

VMware vSAN (all versions)

Cause

When hosts are not able to pass non fragmented packets at the correct MTU configured on the vSAN network to the rest of the vSAN cluster this will cause the cluster to network partition as shown in the below screenshot. Please review the following KB on how to Test the VMkernel network connectivity with the vmkping command to verify correct MTU configuration.  

 

In the screen shot, the first line shows 100% pack lost using MTU size 8872 and 8850. While using an MTU size of 8825, the ping response has no packet loss. 

Resolution

In order for the vSAN cluster to form and handle leader changes and not network partition, the hosts will need to be able to pass a non fragmented packet at the correct MTU. You have two options to correct this issue. 

  1. Work with your Networking team to identify the cause of the MTU mismatch and correct it on the physical network.
  2. Change MTU size on the ESXI hosts to allow for the passing of non fragmented packets on the existing network configuration. Such as if you are not able to pass an MTU of 9000 but 1500 passes, change the MTU of the ESXI hosts to 1500. As vCenter is invalid this will have to be done via the CLI. 

1. Identify the Network Interface vSAN uses via esxcli vsan network list :

[root@server name:~] esxcli vsan network list
Interface
   VmkNic Name: vmk1
   IP Protocol: IP
   Interface UUID: ########-####-####-####-############
   Agent Group Multicast Address: 224.2.3.4
   Agent Group IPv6 Multicast Address: ff19::2:3:4
   Agent Group Multicast Port: 23451
   Master Group Multicast Address: 224.1.2.3
   Master Group IPv6 Multicast Address: ff19::1:2:3
   Master Group Multicast Port: 12345
   Host Unicast Channel Bound Port: 12321
   Data-in-Transit Encryption Key Exchange Port: 0
   Multicast TTL: 5
   Traffic Type: vsan

2. Use the esxcli network ip interface set command to change the MTU:

The general syntax is: esxcli network ip interface set --interface-name=[vmkernel_interface_name] --mtu=[new_MTU] 

Example: esxcli network ip interface set --interface-name=vmk1 --mtu=1500

This command will set the MTU for the network interface named vmk1 to 1500. 

 

Reference kbs 

vSAN Health Service - Network Health - vSAN Cluster Partition