The vSAN cluster fails to restart after a manual shutdown using the prepare script documented in Manually Shut Down and Restart the vSAN Cluster
The vCenter Server resides on the vSAN datastore and is unavailable after the shutdown.
When executing the recover script to restart the vSAN cluster, the process times out at the “Checking network” phase.
[root@hhks000111:~] python /usr/lib/vmware/vsan/bin/reboot_helper.py recoverBegin to recover the cluster...ERROR:root:Fail to get the value of the option 'https_tunnel' fromconf file /etc/vmware/vsan/vsanperf.confTraceback (most recent call last): File "/lib64/python3.11/configparser.py", line 805, in get File "/lib64/python3.11/collections/__init__.py", line 1012, in __getitem__ File "/lib64/python3.11/collections/__init__.py", line 1004, in __missing__KeyError: 'https_tunnel'During handling of the above exception, another exception occurred:Traceback (most recent call last): File "/usr/lib/vmware/vsan/perfsvc/cliutils.py", line 90, in LoadUseHTTPSTunnelConfig File "/lib64/python3.11/configparser.py", line 844, in getboolean File "/lib64/python3.11/configparser.py", line 824, in _get_conv File "/lib64/python3.11/configparser.py", line 819, in _get File "/lib64/python3.11/configparser.py", line 808, in getconfigparser.NoOptionError: No option 'https_tunnel' in section: 'VSANPERF'Time among connected hosts are synchronized.Scheduled vSAN cluster restore task.Waiting for the scheduled task...(24s left)Checking network status...Recovery is not ready, retry after 10s...Recovery is not ready, retry after 10s...Recovery is not ready, retry after 10s...Timeout, please try again later
Multiple attempts to run the recover script fail with the same error, as described in the Symptoms section.
Verify the vSAN tagging status:
Before the cluster shutdown, vSAN tagging is present on the ESXi hosts.
After the manual cluster shutdown, vSAN tagging is removed.
VMware VSAN 7.x
VMware VSAN 8.x
When we run the "reboot_helper.py prepare" script it will backup the VSAN network settings and remove them. If we run the "reboot_helper.py prepare" script again, it will query and back up the vSAN network settings again, overwriting the previous backup. However, since the vSAN network settings had already been modified during the first run, the newly backed-up settings will be incorrect. Due to the incorrect backup, hosts will be skipped during the recovery process
From the /var/log/shell.log file of the ESXI host where the script was run, we can see the timestamp and the number of times the script was run. Here, we can see the prepare script was run twice
2025-03-01T06:53:07.286Z In(14) shell[20664661]: [root]: python /usr/lib/vmware/vsan/bin/reboot_helper.py prepare2025-03-01T06:56:00.587Z In(14) shell[20664661]: [root]: python /usr/lib/vmware/vsan/bin/reboot_helper.py prepareFrom /var/run/log/hostd.log file, we can see that when the prepare script was run for the second time, we have an event logged indicating that vmk0 is not used for VSAN
2025-03-01T06:57:01.178Z Wa(164) Hostd[2100892]: [Originator@6876 sub=Libs opID=esxcli-7a-5742 sid=528365ee user=root] VsanInfoImpl: Vmknic vmk0 is not used for vSAN, skip removal.When the recover script is run to restart the cluster, the ESXi hosts are skipped for recovery due to empty backup. From the /var/run/log/vsanmgmt.log file, we can see that the Get vmk info returns no vmknics and the hosts are skipped
2025-03-02T03:22:40.005Z In(14) vsand[2100039]: [opID=b62333dc VsanRebootUtil::_RestoreHostForClusterRebootWithNAMM] Get vmk info: {}2025-03-02T03:22:40.005Z In(14) vsand[2100039]: [opID=b62333dc VsanRebootUtil::_RestoreHostForClusterRebootWithNAMM] Skip host with no backuped vmkinfo
In a working setup where the recover script is successful, the Get vmk info returns the below output
2025-02-07T18:34:27.245Z In(14) vsand[2100875]: [opID=e5a6bf84 VsanRebootUtil::_RestoreHostForClusterRebootWithNAMM] Get vmk info: {'vmk0': ['vsan']}
This is expected behavior if the prepare script is run multiple times during the manual cluster shutdown and restart process. The vSAN engineering team is aware of this issue and plans to implement a change to improve the manual process in the vSAN 9.0U1 release.
If you encounter this issue during a manual cluster restart, follow these steps:
esxcli network ip interface tag add -i <VMkNic Name> -t=<Traffic Type>
For example: esxcli network ip interface tag add -i vmk2 -t VSAN