When shutting down vSAN stretched cluster with witness traffic sepration using the Cluster Shutdown Wizard it fails with "Operation timed out"
book
Article ID: 318129
calendar_today
Updated On:
Products
VMware vSAN
Issue/Introduction
Symptoms:
vSAN stretched clusters
Witness traffic separation
The cluster is running either ESXi 7.0U3f - U3n or 8.0 - 8.0c.
One host remains powered on while all other hosts have been correctly shut down
Eventually, the automated cluster shutdown process in vCenter fails with an error message "Wait other hosts disconnected timeout <IP address of the orchestration host>".
Note: If the user manually shuts down the orchestration host before the aforementioned error message occurs, the shutdown process will fail with "Operation timed out" after the orchestration host is powered off.
Running the following commands, you can get the network configuration for the orchestration host
# esxcli vsan network list
Interface:
VmkNic Name: vmk1
IP Protocol: IP
Interface UUID: 52e11b48-####-####-####-########5b5
Agent Group Multicast Address: 2##.2.3.4
Agent Group IPv6 Multicast Address: ###9::2:3:4
Agent Group Multicast Port: 23451
Master Group Multicast Address: 2##.1.2.3
Master Group IPv6 Multicast Address: ####::1:2:3
Master Group Multicast Port: 12345
Host Unicast Channel Bound Port: 12321
Data-in-Transit Encryption Key Exchange Port: 0
Multicast TTL: 5
Traffic Type: vsan
Interface:
VmkNic Name: vmk0
IP Protocol: IP
Interface UUID: 52d743e8-####-####-####-########a5f
Agent Group Multicast Address: 2##.2.3.4
Agent Group IPv6 Multicast Address: ###9::2:3:4
Agent Group Multicast Port: 23451
Master Group Multicast Address: 2##.1.2.3
Master Group IPv6 Multicast Address: ###9::1:2:3
Master Group Multicast Port: 12345
Host Unicast Channel Bound Port: 12321
Data-in-Transit Encryption Key Exchange Port: 0
Multicast TTL: 5
Traffic Type: witness
# esxcfg-vmknic -l
Interface Port Group/DVPort/Opaque Network IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type NetStack
vmk0 8 IPv4 192.xxx.x.1 255.255.255.0 192.xxx.x.255 ##:##:f1:##:25:## 1500 65535 true STATIC defaultTcpipStack
vmk1 16 IPv4 10.xxx.x.1 255.255.255.0 10.xx.x.255 ##:##:56:##:20:## 1500 65535 true STATIC defaultTcpipStack
or
# esxcli network ip interface ipv4 get
Name IPv4 Address IPv4 Netmask IPv4 Broadcast Address Type Gateway DHCP DNS
---- ------------- ------------- -------------- ------------ ------------- --------
vmk0 192.xxx.x.1 255.255.255.0 192.xxx.x.255 STATIC 192.xxx.x.254 false
vmk1 10.xxx.x.1 255.255.255.0 10.xxx.x.255 STATIC 192.xxx.x.254 false
When reviewing the /var/run/log/vsanmgmt.log at the time of the automated shutdown process, the below pattern can be observed:
2022-10-07T08:36:53.719Z info vsand[2104476] [opID=089825e4-09a5 VsanRebootUtil::GetLocalHostName] Get localHostname 10.xxx.x.1 2022-10-07T08:40:24.223Z info vsand[2104476] [opID=089825e4-09a5 VsanClusterPowerSystemImpl::PerformOrchestrationClusterPowerAction] Waiting other host power off. Connected host found ['10.xx.x.182'], numLoop 1 2022-10-07T08:42:54.259Z info vsand[2104476] [opID=089825e4-09a5 VsanClusterPowerSystemImpl::PerformOrchestrationClusterPowerAction] Waiting other host power off. Connected host found ['192.xxx.x.1'], numLoop 2 2022-10-07T08:44:54.259Z info vsand[2104476] [opID=089825e4-09a5 VsanClusterPowerSystemImpl::PerformOrchestrationClusterPowerAction] Waiting other host power off. Connected host found ['192.xxx.x.1'], numLoop 3 2022-10-07T08:46:54.259Z info vsand[2104476] [opID=089825e4-09a5 VsanClusterPowerSystemImpl::PerformOrchestrationClusterPowerAction] Waiting other host power off. Connected host found ['192.xxx.x.1'], numLoop 4
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.
For context: 1. IP 10.xxx.x.1 is the IP of the vSAN network for the orchestration host. 2. IP 192.xxx.x.1 is the IP of the Witness traffic for the orchestration host.
The logs indicate that the IP used to identify the host in cluster information was switched from vSAN-IP to the Witness-IP after other hosts were shut down, causing the automated shutdown process logic to see it as a different host, and eventually, the cluster shutdown fails with a timeout error.
Environment
VMware vSAN 8.0.x VMware vSAN 7.0.x
Cause
In the automated shutdown logic, a specific host is chosen as the orchestration host. The purpose of this host is to orchestrate the entire shutdown process automatically across the cluster and it will change necessary advanced settings, enter the hosts in maintenance mode and shut them down accordingly.
As seen in the above logs, the IP changes from the vSAN-IP 19.xx.x.181 to the Witness-IP 192.xxx.x.181 of the orchestration host after the first host in the cluster shuts down. As the IP changes, it is seen as a different host. Therefore, the orchestration host is waiting for the host with 192.xxx.x.181 to be shut down first before it proceeds to shut down itself. As this can never be the case, this will eventually result in a timeout.
Resolution
This issue has been fixed in VMware vSphere ESXi 8.0 U1 and 7.0 U3o respectively.
Workaround: If the vCenter is running outside of the cluster being shut down, you can click on "Resume Shutdown" and the cluster shutdown task will complete