Rebuilding vSAN on hosts via CLI when vCenter is unavailable.
search cancel

Rebuilding vSAN on hosts via CLI when vCenter is unavailable.

book

Article ID: 388144

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

A vSAN environment has lost its vCenter, often due to its running on the same compromised vSAN datastore, and a new one cannot quickly deploy a new one to house the ESXi hosts in question.

Environment

vSAN 7.x

vSAN 8.x

Cause

vSAN datastore is inaccessible, usually due to network issues that now include elements of the hosts' networking, including but not limited to vSAN kernels, unicast tables, and/or cluster membership.

Resolution

  1. Check and, if needed, configure the vSAN VMKernel Interface
  2. check and, if needed, create the vSAN Cluster
  3. Add nodes to the cluster
  4. Check and configure or populate unicast tables

Step 1 – Create the VMKernel interface
In order for vSAN to function, each host/node requires a VMKernel with the vSAN service enabled on it. This requires other dependencies such as a vSwitch and a Port Group.

The following steps include steps for starting from scratch in case GUI interface is not accessible.

To create a Standard vSwitch on which to configure the vSAN kernel run:

esxcli network vswitch standard add -v <vswitch_name>

Example:

esxcli network vswitch standard add -v vSwitch1

Once 'vSwitch1' is created, add the physical uplinks to it. to help identify which uplinks to use, run the following command:

esxcli network nic list

This should return details on all the physical network cards on the host for example:

Name PCI Driver Link Speed Duplex MAC Address MTU Description
vmnic0 0000:01:00.0 ntg3 Up 1000Mbps Full 44:**:**:**:**:98 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic1 0000:01:00.1 ntg3 Up 1000Mbps Full 44:**:**:**:**:99 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic2 0000:02:00.0 ntg3 Down 0Mbps Half 44:**:**:**:**:9a 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic3 0000:02:00.1 ntg3 Down 0Mbps Half 44:**:**:**:**:9b 1500 Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet
vmnic4 0000:82:00.0 ixgbe Up 10000Mbps Full a0:**:**:**:**:cc 1500 Intel Corporation Ethernet Controller 10 Gigabit X540-AT2
vmnic5 0000:82:00.1 ixgbe Up 10000Mbps Full a0:**:**:**:**:ce 1500?? Intel Corporation Ethernet Controller 10 Gigabit X540-AT2
vmnic6 0000:04:00.0 ixgbe Up 10000Mbps Full a0:**:**:**:**:c4 1500 Intel Corporation Ethernet Controller 10 Gigabit X540-AT2
vmnic7 0000:04:00.1 ixgbe Up 10000Mbps Full a0:**:**:**:**:c6 1500 Intel Corporation Ethernet Controller 10 Gigabit X540-AT2

In the following example, vmnic5 is added to 'vSwitch1'.

To add a vmnic to the standard switch created in the last step, run the following command:

esxcli network vswitch standard uplink add -v <vSwitch_name> -u vmnic<X>

Example:

esxcli network vswitch standard uplink add -v vSwitch1 -u vmnic5

Next, configure a portgroup for vSAN, In this example, the portgroup name is “vSAN”

esxcfg-vswitch -A <portgroup_name> <vSwitch_name>

Example:

esxcfg-vswitch -A vSAN vSwitch1

To create a vmkernel interface in the protgroup just created, run the following:

esxcfg-vmknic -a -i <vsan_kernel_IP> -n <subnet_mask> -p <portgroup_name>

Example:

esxcfg-vmknic -a -i 192.168.100.1 -n 255.255.255.0 -p vSAN

Validate the vmkernel Interface by running the following command:

esxcfg-vmknic -l

To enable the vSAN service on the new VMKernel interface, run the following command:

esxcli vsan network ip add -i vmk<X>

Example:

esxcli vsan network ip add -i vmk1

Repeat the above steps on the two remaining hosts that you wish to participate in the cluster

Step 2 – Creating the cluster
Once all the VMKernel interfaces are configured on all the hosts, create a vSAN Cluster on any one host.

To do this we run the following command:

esxcli vsan cluster new

View and take note of the new vSAN Sub-Cluster UUID by running the following command:

esxcli vsan cluster get

Example with output:

[root@se-emea-vsan01:~] esxcli vsan cluster get
Cluster Information
 Enabled: true
 Current Local Time: 2016-11-21T15:17:57Z
 Local Node UUID: 582*****-****-****-****-a***********4
 Local Node Type: NORMAL
 Local Node State: MASTER
 Local Node Health State: HEALTHY
 Sub-Cluster Master UUID: 582*****-0fd8-****-7460-a***********c
 Sub-Cluster Backup UUID: 582*****-cbfc-****-f794-a***********4
 Sub-Cluster UUID: 52bca225-0520-fd68-46c4-5e7edca5dfbd
 Sub-Cluster Membership Entry Revision: 6
 Sub-Cluster Member Count: 1
 Sub-Cluster Member UUIDs: 582*****-cbfc-****-f794-a***********4
 Sub-Cluster Membership UUID: d2d*****-****-bbb9-9e1a-a***********c

Step 3 – Adding the other nodes to the cluster
From the remaining hosts run the following command adding them to the newly created cluster:

esxcli vsan cluster join -u <cluster_uuid>

Example (using the Sub-Cluster UUID from the sample output from Step 2):

esxcli vsan cluster join -u 52bca225-0520-fd68-46c4-5e7edca5dfbd

You can verify that the nodes have successfully joined the cluster by running the same command we ran earlier noting that the Sub-Cluster Member Count has increased to 3 and it also shows the other sub cluster UUID Members:

[root@se-emea-vsan01:~] esxcli vsan cluster get
Cluster Information
 Enabled: true
 Current Local Time: 2016-11-21T15:17:57Z
 Local Node UUID: 582a29ea-cbfc-195e-f794-a0369f7894c4
 Local Node Type: NORMAL
 Local Node State: MASTER
 Local Node Health State: HEALTHY
 Sub-Cluster Master UUID: 582*****-0fd8-****-7460-a***********c
Sub-Cluster Backup UUID: 582*****-cbfc-****-f794-a***********4
Sub-Cluster UUID: 52bca225-0520-fd68-46c4-5e7edca5dfbd Sub-Cluster Membership Entry Revision: 6 Sub-Cluster Member Count: 3 Sub-Cluster Member UUIDs: 582a29ea-cbfc-195e-f794-a0369f7894c4, 582a2bf8-4e36-abbf-5318-a0369f7894d4, 582a2c3b-d104-b96d-d089-a0369f78946c Sub-Cluster Membership UUID: d2d*****-****-bbb9-9e1a-a***********c

Step 4 -- Check and configure Unicast Tables

Follow these steps to manually rebuild the Unicast Agent List on each host:

 

1. Identify the VMkernel port used for vSAN, its IP address, and the node UUID on each host in the cluster.
 

1.1 SSH to every ESXi host in the cluster and login as root.

1.2 Run the following command to identify the VMkernel port used for vSAN, and copy the output for later use: 

esxcli vsan network list

[root@server name:~] esxcli vsan network list
Interface
   VmkNic Name: vmk1
   IP Protocol: IP
   Interface UUID: ########-####-####-####-############
   Agent Group Multicast Address: 224.2.3.4
   Agent Group IPv6 Multicast Address: ff19::2:3:4
   Agent Group Multicast Port: 23451
   Master Group Multicast Address: 224.1.2.3
   Master Group IPv6 Multicast Address: ff19::1:2:3
   Master Group Multicast Port: 12345
   Host Unicast Channel Bound Port: 12321
   Data-in-Transit Encryption Key Exchange Port: 0
   Multicast TTL: 5
   Traffic Type: vsan

            Note: Take note of the VmkNic Name - in the above output it's "vmk1"

1.3 Find the IP address for "vmk1" with the following command:


esxcli network ip interface ipv4 get | grep vmk1
 

[root@server name:~] esxcli network ip interface ipv4 get | grep vmk1
vmk1  ###.##.##.##  ###.##.###.###  ###.##.###.###  STATIC        ###.##.###.###    false


Note: The IP address of vmk1 on this host is: ###.##.##.##

 

1.4 Find the node UUID of the host with the following command:


cmmds-tool whoami

[root@server name:~] cmmds-tool whoami
########-####-####-####-############

 

1.5 Equipped with host UUID and vSAN VMkernel port IP address for ALL hosts in the cluster, start building the Unicast agent lists.
 

                  3 Node cluster example:

server name | UUID: ########-####-####-####-############ | vSAN IP: ###.##.##.##
server name | UUID: ########-####-####-####-############ | vSAN IP: ###.##.##.##
server name | UUID: ########-####-####-####-############ | vSAN IP: ###.##.##.##



2. Building the Unicast Agent List
 

2.1 Before making changes to the Unicast agent lists, run the following command on all nodes in the cluster to temporarily ignore "Cluster Member List Updates" coming from vCenter.

esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListupdates

[root@server name:~]  esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListupdates
Value of IgnoreClusterMemberListUpdates is 1

2.2 Sometimes, there might be incorrect entries in a host's Unicast agent list such as:

    • "Supports Unicast" is set to False for a particular host, an incorrect IP address, an incorrect Node UUID, or a host which is NOT a Stretched Cluster Witness host is incorrectly marked with "IsWitness" as True.
    • An existing entry cannot be modified. It must be deleted and re-added.
 2.2.1 Check the current unicast agent list with the following command:

esxcli vsan cluster unicastagent list

[root@server name:~] esxcli vsan cluster unicastagent list
NodeUuid                              IsWitness  Supports Unicast  IP Address    
------------------------------------  ---------  ----------------  ------------- 
########-####-####-####-############          0             false  ###.##.###.##
########-####-####-####-############          0             false  ###.##.###.##
 

Note: In the output above, the second entry has an incorrect IP address, and both entries have the "Supports Unicast" flag set as "false".

Note: For encrypted environments, take note of the thumbprint as it appears here. It should be the same for all nodes since it comes from the vCenter. It will need to be added to the new entries.
 

2.2.2 To fix this problem, run the following command to delete the errant entries. You can also use this to delete all the entries and rebuild the unicast agent list from scratch.
 

Syntax: esxcli vsan cluster unicastagent remove -a <Host_VSAN_IP>

[root@server name:~] esxcli vsan cluster unicastagent remove -a ###.##.###.##

 Note: By running esxcli vsan cluster unicastagent list we can verify if the entry is cleared. 

2.3 Add entries to the unicast agent list.

    • **REMINDER** When building the Unicast agent list on an ESXi host, add entries for all the other hosts but never add the IP of the host whose table is being configured.

    • When an ESXi host has its own IP address in its Unicast agent list, vSAN can go unstable, networking problems can arise and potentially lead to the host encountering a PSOD.

    • Example : Using the 3 node example, each host will have 2 entries. 

Data Node entry syntax:

esxcli vsan cluster unicastagent add -t node -u <Host_UUID> -U true -a <Host_VSAN_IP> -p 12321

Example on host 1:

[root@server name1:~]esxcli vsan cluster unicastagent add -t node -u ########-####-####-####-############ -U true -a ###.##.###.## -p 12321

[root@server name1:~] esxcli vsan cluster unicastagent list
NodeUuid                            IsWitness  Supports Unicast  IP Address    Port  Iface Name  Cert Thumbprint      SubClusterUuid
------------------------------------  ---------  ----------------  --------------  -----  ----------  -----------------------------------------------------------------------------------------------  --------------
########-####-####-####-############   0                 true   ###.##.###.##  12321                           ########-####-####-####-############

Note: After running the commands, check the Unicast agent list to confirm entries were added correctly as shown in the output above.

Note: in a Stretched Cluster, you must set the "IsWitness" flag to "True" for the Witness host entry.

                  Note: For encrypted environments, you must add the vCenter's thumprint with the -T switch as part of the command string.

Witness Node entry syntax: 

esxcli vsan cluster unicastagent add -t witness -u <Host_UUID> -U true -a <Host_VSAN_IP> -p 12321

2.4 Repeat from step 2.1 on the remaining vSAN hosts, making sure to not include the IP of the host whose table is being configured.

3. Last Step.

3.1 Assuming there are no issues with the physical network, the vSAN Cluster should form right away.
If the rebuild on each host was done correctly, the correct number of Cluster members should be shown in the following command,

[root@server name:~] esxcli vsan cluster get
Cluster Information
   Enabled: true
   Current Local Time: 2024-08-16T06:22:52Z
   Local Node UUID: ########-####-####-####-############
   Local Node Type: NORMAL
   Local Node State: BACKUP
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: ########-####-####-####-############
   Sub-Cluster Backup UUID: ########-####-####-####-############
   Sub-Cluster UUID: ########-####-####-####-############
   Sub-Cluster Membership Entry Revision: 37
   Sub-Cluster Member Count: 2  ←--------------
   Sub-Cluster Member UUIDs: ########-####-####-####-############, ########-####-####-####-############
   Sub-Cluster Member HostNames: server name1, server name2
   Sub-Cluster Membership UUID: ########-####-####-####-############
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
   Config Generation: ########-####-####-####-############ 7 2024-08-16T06:07:52.230
   Mode: REGULAR
 

3.2 After all the process has been completed successfully, make sure to enable the "Cluster Member List Updates" again on all the nodes in the cluster. Run the following command:
 

esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListupdates

[root@server name:~]  esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListupdates
Value of IgnoreClusterMemberListUpdates is 0

 

For further details on vSAN Unicast issues, see:

Configuring vSAN Unicast networking from the command line