When shutting down vSAN stretched cluster with witness traffic sepration using the Cluster Shutdown Wizard it fails with "Operation timed out"
search cancel

When shutting down vSAN stretched cluster with witness traffic sepration using the Cluster Shutdown Wizard it fails with "Operation timed out"

book

Article ID: 318129

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:
  1. vSAN stretched clusters
  2. Witness traffic separation
  3. The cluster is running either ESXi 7.0U3f - U3n or 8.0 - 8.0c.
  4. One host remains powered on while all other hosts have been correctly shut down
  5. Eventually, the automated cluster shutdown process in vCenter fails with an error message "Wait other hosts disconnected timeout <IP address of the orchestration host>".

Note: If the user manually shuts down the orchestration host before the aforementioned error message occurs, the shutdown process will fail with "Operation timed out" after the orchestration host is powered off.
  1. Running the following commands, you can get the network configuration for the orchestration host
# esxcli vsan network list
Interface: VmkNic Name: vmk1 IP Protocol: IP Interface UUID: 52e11b48-####-####-####-########5b5 Agent Group Multicast Address: 2##.2.3.4 Agent Group IPv6 Multicast Address: ###9::2:3:4 Agent Group Multicast Port: 23451 Master Group Multicast Address: 2##.1.2.3 Master Group IPv6 Multicast Address: ####::1:2:3 Master Group Multicast Port: 12345 Host Unicast Channel Bound Port: 12321 Data-in-Transit Encryption Key Exchange Port: 0 Multicast TTL: 5 Traffic Type: vsan Interface: VmkNic Name: vmk0 IP Protocol: IP Interface UUID: 52d743e8-####-####-####-########a5f Agent Group Multicast Address: 2##.2.3.4 Agent Group IPv6 Multicast Address: ###9::2:3:4 Agent Group Multicast Port: 23451 Master Group Multicast Address: 2##.1.2.3 Master Group IPv6 Multicast Address: ###9::1:2:3 Master Group Multicast Port: 12345 Host Unicast Channel Bound Port: 12321 Data-in-Transit Encryption Key Exchange Port: 0 Multicast TTL: 5 Traffic Type: witness # esxcfg-vmknic -l
Interface Port Group/DVPort/Opaque Network IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type NetStack vmk0 8 IPv4 192.xxx.x.1 255.255.255.0 192.xxx.x.255 ##:##:f1:##:25:## 1500 65535 true STATIC defaultTcpipStack vmk1 16 IPv4 10.xxx.x.1 255.255.255.0 10.xx.x.255 ##:##:56:##:20:## 1500 65535 true STATIC defaultTcpipStack or # esxcli network ip interface ipv4 get
Name IPv4 Address IPv4 Netmask IPv4 Broadcast Address Type Gateway DHCP DNS ---- ------------- ------------- -------------- ------------ ------------- -------- vmk0 192.xxx.x.1 255.255.255.0  192.xxx.x.255   STATIC 192.xxx.x.254 false vmk1 10.xxx.x.1 255.255.255.0 10.xxx.x.255   STATIC 192.xxx.x.254 false
 
  1. When reviewing the /var/run/log/vsanmgmt.log at the time of the automated shutdown process, the below pattern can be observed:
2022-10-07T08:36:53.719Z info vsand[2104476] [opID=089825e4-09a5 VsanRebootUtil::GetLocalHostName] Get localHostname 10.xxx.x.1
2022-10-07T08:40:24.223Z info vsand[2104476] [opID=089825e4-09a5 VsanClusterPowerSystemImpl::PerformOrchestrationClusterPowerAction] Waiting other host power off. Connected host found ['10.xx.x.182'], numLoop 1
2022-10-07T08:42:54.259Z info vsand[2104476] [opID=089825e4-09a5 VsanClusterPowerSystemImpl::PerformOrchestrationClusterPowerAction] Waiting other host power off. Connected host found ['192.xxx.x.1'], numLoop 2
2022-10-07T08:44:54.259Z info vsand[2104476] [opID=089825e4-09a5 VsanClusterPowerSystemImpl::PerformOrchestrationClusterPowerAction] Waiting other host power off. Connected host found ['192.xxx.x.1'], numLoop 3
2022-10-07T08:46:54.259Z info vsand[2104476] [opID=089825e4-09a5 VsanClusterPowerSystemImpl::PerformOrchestrationClusterPowerAction] Waiting other host power off. Connected host found ['192.xxx.x.1'], numLoop 4

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

For context:
1. IP 10.xxx.x.1 is the IP of the vSAN network for the orchestration host.
2. IP 192.xxx.x.1 is the IP of the Witness traffic for the orchestration host.

The logs indicate that the IP used to identify the host in cluster information was switched from vSAN-IP to the Witness-IP after other hosts were shut down, causing the automated shutdown process logic to see it as a different host, and eventually, the cluster shutdown fails with a timeout error.


Environment

VMware vSAN 8.0.x
VMware vSAN 7.0.x

Cause

In the automated shutdown logic, a specific host is chosen as the orchestration host. The purpose of this host is to orchestrate the entire shutdown process automatically across the cluster and it will change necessary advanced settings, enter the hosts in maintenance mode and shut them down accordingly.

As seen in the above logs, the IP changes from the vSAN-IP 19.xx.x.181 to the Witness-IP 192.xxx.x.181 of the orchestration host after the first host in the cluster shuts down. As the IP changes, it is seen as a different host. Therefore, the orchestration host is waiting for the host with 192.xxx.x.181 to be shut down first before it proceeds to shut down itself. As this can never be the case, this will eventually result in a timeout.

Resolution

This issue has been fixed in VMware vSphere ESXi 8.0 U1 and 7.0 U3o respectively.

Workaround:
If the vCenter is running outside of the cluster being shut down, you can click on "Resume Shutdown" and the cluster shutdown task will complete