vSAN -- Maintenance -- Simultaneous Host Reboots / Full Cluster Shutdown -- Risk for Data Unavailability

Products

VMware vSAN

Issue/Introduction

Impact/Risks:
As described above, there is an increased risk of Data unavailability when using vSAN Maintenance Mode "No Action/No data evacuation" on multiple vSAN hosts at the same time, followed by a simultaneous reboot of multiple vSAN hosts.

Symptoms:
Performing cluster-level maintenance on a vSAN cluster using the "No Action/No data evacuation" maintenance mode feature followed by a reboot, may result in data unavailability post-maintenance in the event that a failure occurs during cluster start up or if the host is in a vSAN Decom state.

Examples of Issues/Failures :
- Disk(s) failures
- Any other hardware issues
- Host(s) are unable to join the Cluster, e.g. due to Network issues

The issue does not exist when :
- Using any Maintenance Mode option other than "No Action/No data evacuation"
- Rebooting the vSAN hosts by execution of "Rolling Reboots" (e.g. only putting a single vSAN host in Maintenance Mode at a time)

Note:
All VMs should be gracefully powered down prior to cluster-wide maintenance in this fashion, including vCenter Server.
Please disable vSphere HA and set vSphere DRS to Manual for the vSAN cluster if the vCenter Server is running outside of the vSAN cluster and cannot be powered off.

Environment

VMware vSAN (All Versions)

Cause

Several vSAN Components of the same vSAN Object (= VM data) may be flagged as "Stale" as a result of placing multiple vSAN hosts in "No Action/No data evacuation" Maintenance Mode, followed by simultaneously rebooting multiple vSAN hosts.

Any failures/connection issues as described above, might result vSAN objects unable to switch from "Absent Stale" state back to a healthy state and therefore becomes unavailable.
This might result in some VM data become unavailable (= e.g. VMDKs which cannot be brought online due to inaccessible objects).

The VMware vSAN tracks component states during a vSAN cluster shutdown and every DOM object is aware of the component that went offline last. The vSAN expects the last-down components to be available during cluster boot so that DOM objects go accessible with verified current data.

If a node fails to boot from a vSAN cluster shutdown, the last-down components of DOM objects that reside on that host can prevent the DOM objects to be accessible.

Resolution

General: To lower the risk of occurrence:
Before starting to place any vSAN host in Maintenance Mode and prior to shutting down all VMs :

- Please ensure that all VMs are compliant with their Storage Policy to ensure healthy and fully implemented data redundancy.
See here on how to Check Compliance for a VM Storage Policy

- Please ensure that your vSAN is healthy and no hardware failures exist.

You can validate this by running retest on the Skyline Health check:

See here on how to run About the vSAN Health Service

To avoid issues:
- For vSAN hosts 7.0U3 or later, use Shutting Down and Restarting the vSAN Cluster

- For vSAN hosts 6.7 Update 3 (and newer up to 7.0U2) only:
A built-in tool can be used: See Knowledge Base article Using a built-in tool to perform a simultaneous shutdown/reboot of all hosts in the vSAN cluster

- For vSAN hosts below 6.7U3, use the Workaround

If the issue has occurred:
In the event the described issue occurs after the vSAN hosts were rebooted, affected vSAN objects may be recoverable by VMware Support.

Workaround:

To be executed prior to placing any vSAN hosts in Maintenance Mode and rebooting them.

Before starting:
- Please ensure that all VMs are compliant with their Storage Policy to ensure healthy and fully implemented data redundancy.
See here on how to check (select installed Build):
Check Compliance for a VM Storage Policy

- Please ensure that your vSAN is healthy and no hardware failures exist.
You can validate this by running retest on the Skyline Health check.

For vSAN Hosts below 6.7 Update 3:

Part 1:

Note:
The following steps need to be executed on all hosts in the vSAN cluster prior rebooting them simultanously.
The hosts do not need to be in Maintenance Mode to complete these steps :

1. Make sure all the hosts in the cluster are NTP time synchronized.
For more information, see Configuring Network Time Protocol (NTP) on an ESXi host using the vSphere Web Client.

Note: It is very important that all the hosts need to be NTP time synchronized for this workaround to work.

2. Download the scripts attached to this article (pre_reboot.sh & post_reboot.sh) to the vSAN host.
Place them on a persistent storage location except vSAN Datastore, so that they are available after reboot.
If there is no persistent storage other than the vSAN datastore available, have the scripts re-upload to the hosts.

Assign needed execute permission to run the scripts by using below commands via SSH/Putty (remove "$" prior running):

$ chmod +x pre_reboot.sh
$ chmod +x post_reboot.sh

3.) Take a backup of the vSAN network configuration by saving the output of the command "esxcli vsan network list".
Place the file on a persistent storage other than vSAN Datastore.
Example on how to save the output in a txt file:
esxcli vsan network list > vsan_network_list_backup.txt

Example of output on a regular Cluster:
In the following example "vmk0" is the VmkNic interface used for vSAN and the traffic type is "vsan".
Note: Depending on your configuration you may have more than one VmkNic interface configured for vSAN.

[root@sc-rdops-vm06-dhcp-174-97:~] esxcli vsan network list
Interface
VmkNic Name: vmk0
IP Protocol: IP
Interface UUID: ########-####-####-####-########a627
Agent Group Multicast Address: 10.2.3.4
Agent Group IPv6 Multicast Address: ff19::2:3:4
Agent Group Multicast Port: 23451
Master Group Multicast Address: 10.1.2.3
Master Group IPv6 Multicast Address: ff19::1:2:3
Master Group Multicast Port: 12345
Host Unicast Channel Bound Port: 12321
Multicast TTL: 5
Traffic Type: vsan

Example of output on a Stretched Cluster:
In addition to the output above, you might see a dedicated VmkNic configured for communication with Witness host. Here "vmk1" is the VmkNic interface used for communication with Witness host (= traffic type: witness).

VmkNic Name: vmk1
IP Protocol: IP
Interface UUID: ########-####-####-####-########fe94
Agent Group Multicast Address: 10.2.3.4
Agent Group IPv6 Multicast Address: ff19::2:3:4
Agent Group Multicast Port: 23451
Master Group Multicast Address: 10.1.2.3
Master Group IPv6 Multicast Address: ff19::1:2:3
Master Group Multicast Port: 12345
Host Unicast Channel Bound Port: 12321
Multicast TTL: 5
Traffic Type: witness

4. Based on Step 3:
Modify pre_reboot.sh file to disable vSAN traffic on all the related VmkNics found in Step 3 above by performing the following.
For each of the VmkNics, add the following command to the pre_reboot.sh file:
esxcli vsan network ip remove -i <VMkNic Name>

Example:
For the example configuration listed in Step 3 following commands need to be added to pre_reboot.sh script.
esxcli vsan network ip remove -i vmk0
esxcli vsan network ip remove -i vmk1

5. Based on Step 3:
Modify post_reboot.sh file to re-enable vSAN traffic on all the related VmkNics found in Step 3 above by performing the following.
For each of the VmkNics, add the following command to the post_reboot.sh file.
esxcli vsan network ip add -i <VMkNic Name> -T=<Traffic Type>

Example:
For the example configuration listed in Step 3 following commands need to be added to post_reboot.sh script.
esxcli vsan network ip add -i vmk0 -T=vsan
esxcli vsan network ip add -i vmk1 -T=witness

6. Create a CRON job that runs the "pre_reboot.sh" script at the exact same time on all of the vSAN hosts of the related cluster.
Note: Select a time in the future where all vSAN hosts will be available to run the script at the same time.

6.1) Create a backup of the current Crontab file:
Crontab file: /var/spool/cron/crontabs/root
Copy it to another non-vSAN Location.
Example:
cp /var/spool/cron/crontabs/root /vmfs/volumes/datastore1/root_crontab.BKP

6.2) Edit the current Crontab file:
Crontab file: /var/spool/cron/crontabs/root
Open it via command: vi /var/spool/cron/crontabs/root

Add the CRON job at the end of this file:
Example:
Note: The format to be used in the Crontab file is: #min hour day mon dow command
If we want "pre_reboot.sh" to be run on all hosts simultaneously on Dec 15, 20:30 UTC, we need to add the following line on each vSAN Host:
30 20 15 12 * /vmfs/volumes/datastore/pre_reboot.sh

6.3) Stop and restart the current running CROND daemon on the vSAN host by
running these two commands (remove "$" prior running):
$ kill -HUP $(cat /var/run/crond.pid)
$ /usr/lib/vmware/busybox/bin/busybox crond

7. Repeat Steps 2-6 on the next vSAN host

8. After the CRON job has been run on all vSAN Hosts (e.g. on Dec 15, 20:30 UTC), verify that all vSAN hosts "Node State" is "MASTER" via:
esxcli vsan cluster get | grep "Local Node State"

9. Proceed with placing all vSAN hosts in Maintenance Mode with "No Action" via executing the following command on each host (remove "$" prior running):
$ esxcli system maintenanceMode set -e true -m noAction

10. Proceed with rebooting of all vSAN hosts

Part 2:

Note:
The following steps need to be executed on all hosts in the vSAN cluster once all the vSAN hosts are back from the Reboot (= back online).

11. Verify on all vSAN hosts that Node State is "MASTER" via:
esxcli vsan cluster get | grep "Local Node State"

12. Check and ensure that all the vSAN disks are showing up as mounted i.e. with CMMDS status as "true" on each vSAN host :
esxcli vsan storage list | grep "In CMMDS"
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true

If you are seeing one more disks entries with "false", the host might be still initializing the related disk(s) or has encountered a disk issue.
You can check on any disk Issues by e.g. running the vSAN Healthcheck.

13. On all vSAN hosts: Exit the Maintenance Mode via below command :
esxcli system maintenanceMode set -e false

14. Verify that all hosts are out of vSAN Maintenance Mode via:
esxcli vsan cluster get | grep "Maintenance Mode State"
Maintenance Mode State: OFF

15. On each vSAN host: Ensure that the changes made to "post_reboot.sh" prior to the reboot are not lost (= Step 5).

16. Create a CRON job that runs the "post_reboot.sh" script to re-enable vSAN traffic on the VmkNic interfaces.
Note: Select a time in the future where all vSAN hosts will be available to run the script at the same time.

16.1) Create a backup of the current Crontab file:
Crontab file: /var/spool/cron/crontabs/root
Copy it to another non-vSAN Location.
Example:
cp /var/spool/cron/crontabs/root /vmfs/volumes/datastore1/root_crontab_Post_Reboot.BKP2

16.2) Edit the current Crontab file:
Crontab file: /var/spool/cron/crontabs/root
Open it via command: vi /var/spool/cron/crontabs/root
Remove the entry made via Part 1 - 6.2.
Add the new CRON job at the end of this file.
Example:
Note: The format to be used in the Crontab file is: #min hour day mon dow command
If we want "post_reboot.sh" to be run on all hosts simultaneously on Dec 15, 21:00 UTC, then we need to add the following line on each vSAN host:
00 21 15 12 * /vmfs/volumes/datastore/post_reboot.sh

16.3) Stop and restart the current running CROND daemon on the vSAN host by
running these two commands (remove "$" prior running):
$ kill -HUP $(cat /var/run/crond.pid)
$ /usr/lib/vmware/busybox/bin/busybox crond

17. After the new CRON job has been run on all vSAN hosts (e.g. on Dec 15, 20:30 UTC),
verify that all objects are healthy (= no inaccessible Objects).
For that to be true the output of the following command needs to be empty:
cmmds-tool find -f python | grep -C5 CONFIG_STATUS | grep content | grep -v "state....7\|state....15"

18. Edit the current Crontab file:
Crontab file: /var/spool/cron/crontabs/root
Open it via command: vi /var/spool/cron/crontabs/root
Remove the entry made via Part 2 - 16.2.

19. Stop and restart the current running CROND daemon on the vSAN host by
running these two commands (remove "$" prior running):
$ kill -HUP $(cat /var/run/crond.pid)
$ /usr/lib/vmware/busybox/bin/busybox crond

Additional Information

Attachments

pre_reboot.sh get_app

post_reboot.sh get_app