Unable to restart a vSAN Cluster after manual shutdown using reboot

Products

VMware vSAN

Issue/Introduction

Symptoms

The vSAN cluster fails to restart after a manual shutdown using the prepare script documented in Manually Shut Down and Restart the vSAN Cluster
The vCenter Server resides on the vSAN datastore and is unavailable after the shutdown.
When executing the recover script to restart the vSAN cluster, the process times out at the “Checking network” phase.

[root@hhks000111:~] python /usr/lib/vmware/vsan/bin/reboot_helper.py recover
Begin to recover the cluster...
ERROR:root:Fail to get the value of the option 'https_tunnel' fromconf file /etc/vmware/vsan/vsanperf.conf
Traceback (most recent call last):
File "/lib64/python3.11/configparser.py", line 805, in get
File "/lib64/python3.11/collections/__init__.py", line 1012, in __getitem__
File "/lib64/python3.11/collections/__init__.py", line 1004, in __missing__
KeyError: 'https_tunnel'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/vmware/vsan/perfsvc/cliutils.py", line 90, in LoadUseHTTPSTunnelConfig
File "/lib64/python3.11/configparser.py", line 844, in getboolean
File "/lib64/python3.11/configparser.py", line 824, in _get_conv
File "/lib64/python3.11/configparser.py", line 819, in _get
File "/lib64/python3.11/configparser.py", line 808, in get
configparser.NoOptionError: No option 'https_tunnel' in section: 'VSANPERF'
Time among connected hosts are synchronized.
Scheduled vSAN cluster restore task.
Waiting for the scheduled task...(24s left)
Checking network status...
Recovery is not ready, retry after 10s...
Recovery is not ready, retry after 10s...
Recovery is not ready, retry after 10s...
Timeout, please try again later
The vSAN datastore is inaccessible, preventing virtual machines from powering on.

Validation Steps:

Multiple attempts to run the recover script fail with the same error, as described in the Symptoms section.
Verify the vSAN tagging status:

Before the cluster shutdown, vSAN tagging is present on the ESXi hosts.

After the manual cluster shutdown, vSAN tagging is removed.

Environment

VMware VSAN 7.x

VMware VSAN 8.x

Cause

When we run the "reboot_helper.py prepare" script it will backup the VSAN network settings and remove them. If we run the "reboot_helper.py prepare" script again, it will query and back up the vSAN network settings again, overwriting the previous backup. However, since the vSAN network settings had already been modified during the first run, the newly backed-up settings will be incorrect. Due to the incorrect backup, hosts will be skipped during the recovery process

Cause Validation:

From the /var/log/shell.log file of the ESXI host where the script was run, we can see the timestamp and the number of times the script was run. Here, we can see the prepare script was run twice
2025-03-01T06:53:07.286Z In(14) shell[20664661]: [root]: python /usr/lib/vmware/vsan/bin/reboot_helper.py prepare
2025-03-01T06:56:00.587Z In(14) shell[20664661]: [root]: python /usr/lib/vmware/vsan/bin/reboot_helper.py prepare
From /var/run/log/hostd.log file, we can see that when the prepare script was run for the second time, we have an event logged indicating that vmk0 is not used for VSAN
2025-03-01T06:57:01.178Z Wa(164) Hostd[2100892]: [Originator@6876 sub=Libs opID=esxcli-7a-5742 sid=528365ee user=root] VsanInfoImpl: Vmknic vmk0 is not used for vSAN, skip removal.
When the recover script is run to restart the cluster, the ESXi hosts are skipped for recovery due to empty backup. From the /var/run/log/vsanmgmt.log file, we can see that the Get vmk info returns no vmknics and the hosts are skipped

2025-03-02T03:22:40.005Z In(14) vsand[2100039]: [opID=b62333dc VsanRebootUtil::_RestoreHostForClusterRebootWithNAMM] Get vmk info: {}
2025-03-02T03:22:40.005Z In(14) vsand[2100039]: [opID=b62333dc VsanRebootUtil::_RestoreHostForClusterRebootWithNAMM] Skip host with no backuped vmkinfo
In a working setup where the recover script is successful, the Get vmk info returns the below output

2025-02-07T18:34:27.245Z In(14) vsand[2100875]: [opID=e5a6bf84 VsanRebootUtil::_RestoreHostForClusterRebootWithNAMM] Get vmk info: {'vmk0': ['vsan']}

Resolution

This is expected behavior if the prepare script is run multiple times during the manual cluster shutdown and restart process. The vSAN engineering team is aware of this issue and plans to implement a change to improve the manual process in the vSAN 9.0U1 release.

Workaround:

If you encounter this issue during a manual cluster restart, follow these steps:

Run the following command on all ESXi hosts to restore vSAN tagging:

esxcli network ip interface tag add -i <VMkNic Name> -t=<Traffic Type> For example: esxcli network ip interface tag add -i vmk2 -t VSAN
Re-run the recover script.
Once the script completes successfully, continue with the steps documented in Manually Shut Down and Restart the vSAN Cluster