When shutting down vSAN stretched cluster with witness traffic sepration using the Cluster Shutdown Wizard it fails with "Operation timed out"
search cancel

When shutting down vSAN stretched cluster with witness traffic sepration using the Cluster Shutdown Wizard it fails with "Operation timed out"

book

Article ID: 318129

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:
  1. vSAN stretched clusters
  2. Witness traffic separation
  3. The cluster is running either ESXi 7.0U3f - U3n or 8.0 - 8.0c.
  4. One host remains powered on while all other hosts have been correctly shut down
  5. Eventually, the automated cluster shutdown process in vCenter fails with an error message "Wait other hosts disconnected timeout <IP address of the orchestration host>".
vSAN-Cluster-shutdown-error.png
Note: If the user manually shuts down the orchestration host before the aforementioned error message occurs, the shutdown process will fail with "Operation timed out" after the orchestration host is powered off.
image.png
  1. Running the following commands you can get the network configuration for the orchestration host
esxcli vsan network list
Interface:
   VmkNic Name: vmk1
   IP Protocol: IP
   Interface UUID: 52e11b48-d3f7-37c9-ee3e-eb5e59e045b5
   Agent Group Multicast Address: 224.2.3.4
   Agent Group IPv6 Multicast Address: ff19::2:3:4
   Agent Group Multicast Port: 23451
   Master Group Multicast Address: 224.1.2.3
   Master Group IPv6 Multicast Address: ff19::1:2:3
   Master Group Multicast Port: 12345
   Host Unicast Channel Bound Port: 12321
   Data-in-Transit Encryption Key Exchange Port: 0
   Multicast TTL: 5
   Traffic Type: vsan

Interface:
   VmkNic Name: vmk0
   IP Protocol: IP
   Interface UUID: 52d743e8-6f04-85e8-70eb-cc22df14da5f
   Agent Group Multicast Address: 224.2.3.4
   Agent Group IPv6 Multicast Address: ff19::2:3:4
   Agent Group Multicast Port: 23451
   Master Group Multicast Address: 224.1.2.3
   Master Group IPv6 Multicast Address: ff19::1:2:3
   Master Group Multicast Port: 12345
   Host Unicast Channel Bound Port: 12321
   Data-in-Transit Encryption Key Exchange Port: 0
   Multicast TTL: 5
   Traffic Type: witness

esxcfg-vmknic -l
Interface  Port Group/DVPort/Opaque Network        IP Family IP Address                              Netmask         Broadcast       MAC Address       MTU     TSO MSS   Enabled Type                NetStack            
vmk0       8                                       IPv4      192.xxx.x.181                           255.255.255.0   192.xxx.x.255   b4:7a:f1:82:25:4c 1500    65535     true    STATIC              defaultTcpipStack   
vmk1       16                                      IPv4      19.xx.x.181                             255.255.255.0   19.xx.x.255     00:50:56:60:20:03 1500    65535     true    STATIC              defaultTcpipStack   

or

esxcli network ip interface ipv4 get
Name  IPv4 Address   IPv4 Netmask   IPv4 Broadcast  Address Type  Gateway      DHCP DNS
----  -------------  -------------  --------------  ------------  -----------  --------
vmk0  192.xxx.x.181  255.255.255.0  192.168.3.255   STATIC        192.168.3.1     false
vmk1  19.xx.x.181    255.255.255.0  19.16.3.255     STATIC        192.168.3.1     false
 
  1. When reviewing the /var/run/log/vsanmgmt.log at the time of the automated shutdown process, the below pattern can be observed:
2022-10-07T08:36:53.719Z info vsand[2104476] [opID=089825e4-09a5 VsanRebootUtil::GetLocalHostName] Get localHostname 19.xx.x.181
2022-10-07T08:40:24.223Z info vsand[2104476] [opID=089825e4-09a5 VsanClusterPowerSystemImpl::PerformOrchestrationClusterPowerAction] Waiting other host power off. Connected host found ['19.xx.x.182'], numLoop 1
2022-10-07T08:42:54.259Z info vsand[2104476] [opID=089825e4-09a5 VsanClusterPowerSystemImpl::PerformOrchestrationClusterPowerAction] Waiting other host power off. Connected host found ['192.xxx.x.181'], numLoop 2
2022-10-07T08:44:54.259Z info vsand[2104476] [opID=089825e4-09a5 VsanClusterPowerSystemImpl::PerformOrchestrationClusterPowerAction] Waiting other host power off. Connected host found ['192.xxx.x.181'], numLoop 3
2022-10-07T08:46:54.259Z info vsand[2104476] [opID=089825e4-09a5 VsanClusterPowerSystemImpl::PerformOrchestrationClusterPowerAction] Waiting other host power off. Connected host found ['192.xxx.x.181'], numLoop 4

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

For context:
1. IP 19.xx.x.181 is the IP of the vSAN network for the orchestration host.
2. IP 192.xxx.x.181 is the IP of the Witness traffic for the orchestration host.

The logs indicate that the IP used to identify the host in cluster information was switched from vSAN-IP to the Witness-IP after other hosts were shut down, causing the automated shutdown process logic to see it as a different host, and eventually, the cluster shutdown fails with a timeout error.


Environment

VMware vSAN 8.0.x
VMware vSAN 7.0.x

Cause

In the automated shutdown logic, a specific host is chosen as the orchestration host. The purpose of this host is to orchestrate the entire shutdown process automatically across the cluster and it will change necessary advanced settings, enter the hosts in maintenance mode and shut them down accordingly.

As seen in the above logs, the IP changes from the vSAN-IP 19.xx.x.181 to the Witness-IP 192.xxx.x.181 of the orchestration host after the first host in the cluster shuts down. As the IP changes, it is seen as a different host. Therefore, the orchestration host is waiting for the host with 192.xxx.x.181 to be shut down first before it proceeds to shut down itself. As this can never be the case, this will eventually result in a timeout.

Resolution

This issue has been fixed in ESXi 8.0 U1 and 7.0 U3o respectively.

Workaround:
If the vCenter is running outside of the cluster being shut down you can click on "Resume Shutdown" and the cluster shutdown task will complete