vSAN -- Cluster down. VMs are showing as "invalid". Network partition caused by an invalid/incomplete unicast agent list on vSAN host(s)
search cancel

vSAN -- Cluster down. VMs are showing as "invalid". Network partition caused by an invalid/incomplete unicast agent list on vSAN host(s)

book

Article ID: 317830

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

One or more of the following Symptoms apply:

 

  • VMs are showing as "invalid" in the Host Client:

 
 

If the Web Client is not available: Log into one of the vSAN Hosts via SSH/Putty and run the following command:

esxcli vsan health cluster list
 
The output shows that the Health Test  "vSAN cluster partition" is showing with status "red"
 
Example:
 

Note: use the -w switch to show the shorten names of the tests to use with esxcli vsan health cluster get -t clusterpartition to know which hosts are partitioned. You can also use the full name in quotes like so esxcli vsan health cluster get -t "vSAN cluster partition"
 
 
( In particular the Chapters "MTU Check via vmkping" and "List of Ports and Protocols required for vSAN" )

 

  • Post adding a new Host to the vSAN Cluster, the alert "Host cannot communicate with one or more other nodes in the vSAN enabled cluster" is shown on the Summary Tab of all vSAN Hosts:


 

Environment

VMware vSAN (All Versions)

Cause

One or more vSAN Hosts are missing from the Cluster due to wrong and/or incomplete Unicast Agent List on one or more vSAN Hosts.

Following the verification based on the example of a 4 Node Cluster:

 

1.) The output of the command esxcli vsan cluster get verifies that one vSAN Host is missing from the Cluster:

[root@######:~] esxcli vsan cluster get
Cluster Information
   Enabled: true
   Current Local Time: 2021-03-30T13:40:44Z
   Local Node UUID: 602583eb-233c-b69a-8291-
   Local Node Type: NORMAL
   Local Node State: MASTER
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 602583eb-233c-b69a-8291-############
   Sub-Cluster Backup UUID: 602572bd-2ef4-8f69-d8ce-############
   Sub-Cluster UUID: 52cd69c8-e409-363f-bd75-############
   Sub-Cluster Membership Entry Revision: 4
   Sub-Cluster Member Count: 3
   Sub-Cluster Member UUIDs: 602572bd-2ef4-8f69-d8ce-############, 602583eb-233c-b69a-8291-############, 60198995-b367-2922-############
   Sub-Cluster Member HostNames: esxi-02, esxi-04, esxi-03
   Sub-Cluster Membership UUID: f4266360-e165-0b0b-############
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
   Config Generation: 81f5b3c2-fe55-4a00-9eb5-############ 20 2021-03-30T13:26:12.0

 

2.) The Unicastagent list on each of the vSAN Hosts shows only 2 entries instead of 3:

[root@######:~] esxcli vsan cluster unicastagent list
NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name  Cert Thumbprint                                              SubClusterUuid
------------------------------------  ---------  ----------------  -------------  -----  ----------  -----------------------------------------------------------  --------------
602572bd-2ef4-8f69-d8ce-############          0              true  ###.###.10.13  12321              
60198995-b367-2922-8fbf-############          0              true  ###.###.10.14  12321       

 
Clarification based on Example above:
 
The above example shows that the Unicastagent list is incomplete.
A complete and correct Unicastagent list on each Host in a vSAN cluster will have entries of all Hosts in the Cluster except itself.
( There can also be situations where the Unicastagent list entries are wrong/contain old information or be completely empty. )

 

A possible cause for the Unicastagent list being incorrect/incomplete is if IgnoreClusterMemberListupdates parameter on the ESXi host is set to a value of 1 on one or more hosts in the cluster.

A value of 1 tells the host to ignore any updates coming from vCenter regarding the unicast agent list.

A value of 0, which is the default setting, tells the host to accept the changes coming from vCenter.

To check the current setting run the following command:
esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates

 

Resolution

  1. Before making any changes to the Unicastagent list via ESXi CLI IgnoreClusterMemberListupdates needs to be set to 1 on all hosts
    To set to value of "1": esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListupdates

     Based on the 4 Node Cluster example outlined above:

  2. Check the Unicastagent list on all vSAN Hosts to determine which is incomplete:

    [root@esxi-01:~] esxcli vsan cluster unicastagent list
    NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name  Cert Thumbprint SubClusterUuid                           
    ------------------------------------  ---------  ----------------  -------------  -----  ----------  ----------------------------------------
    602572bd-2ef4-8f69-d8ce-############          0              true  ###.###.10.13  12321              
    60198995-b367-2922-8fbf-############          0              true  ###.###.10.14  12321              
    602583eb-233c-b69a-8291-############          0              true  ###.###.10.12  12321              


    [root@esxi-02:~] esxcli vsan cluster unicastagent list
    NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name  Cert Thumbprint SubClusterUuid                           
    ------------------------------------  ---------  ----------------  -------------  -----  ----------  ----------------------------------------

    60257046-5d95-a750-7135-############          0              true  ###.###.10.11  12321              
    60198995-b367-2922-8fbf-############          0              true  ###.###.10.14  12321              
    602583eb-233c-b69a-8291-############          0              true  ###.###.10.12  12321              

    [root@esxi-03:~] esxcli vsan cluster unicastagent list
    NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name  Cert Thumbprint SubClusterUuid                           
    ------------------------------------  ---------  ----------------  -------------  -----  ----------  ----------------------------------------

    602572bd-2ef4-8f69-d8ce-############          0              true  ###.###.10.13  12321              
    60257046-5d95-a750-7135-############          0              true  ###.###.10.11  12321              
    602583eb-233c-b69a-8291-############          0              true  ###.###.10.12  12321              

    [root@esxi-04:~] esxcli vsan cluster unicastagent list
    NodeUuid                              IsWitness  Supports Unicast  IP Address      Port  Iface Name  Cert Thumbprint SubClusterUuid                           
    ------------------------------------  ---------  ----------------  -------------  -----  ----------  ----------------------------------------

    602572bd-2ef4-8f69-d8ce-############          0              true  ###.###.10.13  12321              
    60198995-b367-2922-8fbf-############          0              true  ###.###.10.14  12321              

    ---> esxi-04 is missing 1 host (esxi-01)"

  3. Find the UUID and vSAN IP of the missing/invalid Host:
    Go to the missing Host and get the UUID:
    [root@esxi-01:~] cmmds-tool whoami
    60257046-5d95-a750-7135-############
    Go to the missing Host and get the vSAN vmk IP address:

    [root@esxi-01:~] esxcfg-vmknic -l
    Interface  Port Group/DVPort/Opaque Network        IP Family IP Address                              Netmask         Broadcast       MAC Address       MTU     TSO MSS   Enabled Type                NetStack 
    -------------------output shrinked------------------------
    vmk2       vmotion                                 IPv6      fe80::250:####:####:####                64                              00:50:##:##:##:## 1500    65535     true    STATIC, PREFERRED   defaultTcpipStack
    vmk3       vsan                                    IPv4      ###.###.10.11                           255.255.255.0   ###.###.10.255  00:50:##:##:##:## 1500    65535     true    STATIC              defaultTcpipStack
    vmk3       vsan                                    IPv6      fe80::250:####:####:####                64                              00:50:##:##:##:## 1500    65535     true    STATIC, PREFERRED   defaultTcpipStack

     
    --> vSAN vmk2 is used for vSAN.
    Get the thumbprint of the missing/invalid Host:
    [root@esxi-01~] openssl x509 -in /etc/vmware/ssl/rui.crt -fingerprint -sha1 -noout
    sha1 Fingerprint=##:##:##:##

  4. Add the missing entry to the Unicast Agent list ( Reference ) on all Host(s) where it is missing (in our example here: It is only missing on Host esxi-04:
    Syntax:
    esxcli vsan cluster unicastagent add -t node -u <Host_UUID> -U true -a <Host_VSAN_IP> -p 12321 -T <Host Cert Thumbprint>

    [root@esxi-04:~] esxcli vsan cluster unicastagent add -t node -u 60257046-5d95-a750-7135-############ -U true -a ###.###.10.11 -p 12321 -T ##:##:##:##:##
  5. Verify that the Cluster is complete:
    [root@esxi-04:~] esxcli vsan cluster get
    Cluster Information
       Enabled: true
       Current Local Time: 2021-03-30T14:21:55Z
       Local Node UUID: 602583eb-233c-b69a-8291-############
       Local Node Type: NORMAL
       Local Node State: AGENT
       Local Node Health State: HEALTHY
       Sub-Cluster Master UUID: 60257046-5d95-a750-7135-############
       Sub-Cluster Backup UUID: 60198995-b367-2922-8fbf-############
       Sub-Cluster UUID: 52cd69c8-e409-363f-bd75-############
       Sub-Cluster Membership Entry Revision: 5
       Sub-Cluster Member Count: 4
       Sub-Cluster Member UUIDs: 60257046-5d95-a750-7135-############, 60198995-b367-2922-8fbf-############, 602572bd-2ef4-8f69-d8ce-############, 602583eb-233c-b69a-8291-############
       Sub-Cluster Member HostNames: esxi-01, esxi-03, esxi-02, esxi-04
       Sub-Cluster Membership UUID: f5a46060-4df7-160b-4fdc-############
       Unicast Mode Enabled: true
       Maintenance Mode State: OFF
       Config Generation: 81f5b3c2-fe55-4a00-9eb5-############ 20 2021-03-30T14:21:15.294

  6. Once all changes have been made to the Unicastagent list on all impacted hosts set IgnoreClusterMemberListupdates back to default value of 0.
    To set to value of "0": esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListupdates

If assistance is required, please open a Ticket with VMware by Broadcom Support.

 

Additional Information