To be executed prior to placing any vSAN hosts in Maintenance Mode and rebooting them.
Before starting:
- Please ensure that all VMs are compliant with their Storage Policy to ensure healthy and fully implemented data redundancy.
See here on how to check (select installed Build):
Check Compliance for a VM Storage Policy
- Please ensure that your vSAN is healthy and no hardware failures exist.
You can validate this by running retest on the Skyline Health check:
See here on how to run (select installed Build):
Check vSAN Health
For vSAN Hosts below 6.7 Update 3:
Note:
The following steps need to be executed on all hosts in the vSAN cluster prior rebooting them simultanously.
The hosts do not need to be in Maintenance Mode to complete these steps :
1. Make sure all the hosts in the cluster are NTP time synchronized.
For more information, see Configuring Network Time Protocol (NTP) on an ESXi host using the vSphere Web Client.
Note: It is very important that all the hosts need to be NTP time synchronized for this workaround to work.
2. Download the scripts attached to this article (pre_reboot.sh & post_reboot.sh) to the vSAN host.
Place them on a persistent storage location except vSAN Datastore, so that they are available after reboot.
If there is no persistent storage other than the vSAN datastore available, have the scripts re-upload to the hosts.
Assign needed execute permission to run the scripts by using below commands via SSH/Putty (remove "$" prior running):
$ chmod +x pre_reboot.sh
$ chmod +x post_reboot.sh
3.) Take a backup of the vSAN network configuration by saving the output of the command "esxcli vsan network list".
Place the file on a persistent storage other than vSAN Datastore.
Example on how to save the output in a txt file:
esxcli vsan network list > vsan_network_list_backup.txt
Example of output on a regular Cluster:
In the following example "vmk0" is the VmkNic interface used for vSAN and the traffic type is "vsan".
Note: Depending on your configuration you may have more than one VmkNic interface configured for vSAN.
[root@sc-rdops-vm06-dhcp-174-97:~] esxcli vsan network list
Interface
VmkNic Name: vmk0
IP Protocol: IP
Interface UUID: ########-####-####-####-########a627
Agent Group Multicast Address: 10.2.3.4
Agent Group IPv6 Multicast Address: ff19::2:3:4
Agent Group Multicast Port: 23451
Master Group Multicast Address: 10.1.2.3
Master Group IPv6 Multicast Address: ff19::1:2:3
Master Group Multicast Port: 12345
Host Unicast Channel Bound Port: 12321
Multicast TTL: 5
Traffic Type: vsan
Example of output on a Stretched Cluster:
In addition to the output above, you might see a dedicated VmkNic configured for communication with Witness host. Here "vmk1" is the VmkNic interface used for communication with Witness host (= traffic type: witness).
VmkNic Name: vmk1
IP Protocol: IP
Interface UUID: ########-####-####-####-########fe94
Agent Group Multicast Address: 10.2.3.4
Agent Group IPv6 Multicast Address: ff19::2:3:4
Agent Group Multicast Port: 23451
Master Group Multicast Address: 10.1.2.3
Master Group IPv6 Multicast Address: ff19::1:2:3
Master Group Multicast Port: 12345
Host Unicast Channel Bound Port: 12321
Multicast TTL: 5
Traffic Type: witness
4. Based on Step 3:
Modify pre_reboot.sh file to disable vSAN traffic on all the related VmkNics found in Step 3 above by performing the following.
For each of the VmkNics, add the following command to the pre_reboot.sh file:
esxcli vsan network ip remove -i <VMkNic Name>
Example:
For the example configuration listed in Step 3 following commands need to be added to pre_reboot.sh script.
esxcli vsan network ip remove -i vmk0
esxcli vsan network ip remove -i vmk1
5. Based on Step 3:
Modify post_reboot.sh file to re-enable vSAN traffic on all the related VmkNics found in Step 3 above by performing the following.
For each of the VmkNics, add the following command to the post_reboot.sh file.
esxcli vsan network ip add -i <VMkNic Name> -T=<Traffic Type>
Example:
For the example configuration listed in Step 3 following commands need to be added to post_reboot.sh script.
esxcli vsan network ip add -i vmk0 -T=vsan
esxcli vsan network ip add -i vmk1 -T=witness
6. Create a CRON job that runs the "pre_reboot.sh" script at the exact same time on all of the vSAN hosts of the related cluster.
Note: Select a time in the future where all vSAN hosts will be available to run the script at the same time.
6.1) Create a backup of the current Crontab file:
Crontab file: /var/spool/cron/crontabs/root
Copy it to another non-vSAN Location.
Example:
cp /var/spool/cron/crontabs/root /vmfs/volumes/datastore1/root_crontab.BKP
6.2) Edit the current Crontab file:
Crontab file: /var/spool/cron/crontabs/root
Open it via command: vi /var/spool/cron/crontabs/root
Add the CRON job at the end of this file:
Example:
Note: The format to be used in the Crontab file is: #min hour day mon dow command
If we want "pre_reboot.sh" to be run on all hosts simultaneously on Dec 15, 20:30 UTC, we need to add the following line on each vSAN Host:
30 20 15 12 * /vmfs/volumes/datastore/pre_reboot.sh
6.3) Stop and restart the current running CROND daemon on the vSAN host by
running these two commands (remove "$" prior running):
$ kill -HUP $(cat /var/run/crond.pid)
$ /usr/lib/vmware/busybox/bin/busybox crond
7. Repeat Steps 2-6 on the next vSAN host
8. After the CRON job has been run on all vSAN Hosts (e.g. on Dec 15, 20:30 UTC), verify that all vSAN hosts "Node State" is "MASTER" via:
esxcli vsan cluster get | grep "Local Node State"
9. Proceed with placing all vSAN hosts in Maintenance Mode with "No Action" via executing the following command on each host (remove "$" prior running):
$ esxcli system maintenanceMode set -e true -m noAction
10. Proceed with rebooting of all vSAN hosts
Note:
The following steps need to be executed on all hosts in the vSAN cluster once all the vSAN hosts are back from the Reboot (= back online).
11. Verify on all vSAN hosts that Node State is "MASTER" via:
esxcli vsan cluster get | grep "Local Node State"
12. Check and ensure that all the vSAN disks are showing up as mounted i.e. with CMMDS status as "true" on each vSAN host :
esxcli vsan storage list | grep "In CMMDS"
In CMMDS: true
In CMMDS: true
In CMMDS: true
In CMMDS: true
If you are seeing one more disks entries with "false", the host might be still initializing the related disk(s) or has encountered a disk issue.
You can check on any disk Issues by e.g. running the vSAN Healthcheck.
See here on how to run (select installed Build):
Check vSAN Health
13. On all vSAN hosts: Exit the Maintenance Mode via below command :
esxcli system maintenanceMode set -e false
14. Verify that all hosts are out of vSAN Maintenance Mode via:
esxcli vsan cluster get | grep "Maintenance Mode State"
Maintenance Mode State: OFF
15. On each vSAN host: Ensure that the changes made to "post_reboot.sh" prior to the reboot are not lost (= Step 5).
16. Create a CRON job that runs the "post_reboot.sh" script to re-enable vSAN traffic on the VmkNic interfaces.
Note: Select a time in the future where all vSAN hosts will be available to run the script at the same time.
16.1) Create a backup of the current Crontab file:
Crontab file: /var/spool/cron/crontabs/root
Copy it to another non-vSAN Location.
Example:
cp /var/spool/cron/crontabs/root /vmfs/volumes/datastore1/root_crontab_Post_Reboot.BKP2
16.2) Edit the current Crontab file:
Crontab file: /var/spool/cron/crontabs/root
Open it via command: vi /var/spool/cron/crontabs/root
Remove the entry made via Part 1 - 6.2.
Add the new CRON job at the end of this file.
Example:
Note: The format to be used in the Crontab file is: #min hour day mon dow command
If we want "post_reboot.sh" to be run on all hosts simultaneously on Dec 15, 21:00 UTC, then we need to add the following line on each vSAN host:
00 21 15 12 * /vmfs/volumes/datastore/post_reboot.sh
16.3) Stop and restart the current running CROND daemon on the vSAN host by
running these two commands (remove "$" prior running):
$ kill -HUP $(cat /var/run/crond.pid)
$ /usr/lib/vmware/busybox/bin/busybox crond
17. After the new CRON job has been run on all vSAN hosts (e.g. on Dec 15, 20:30 UTC),
verify that all objects are healthy (= no inaccessible Objects).
For that to be true the output of the following command needs to be empty:
cmmds-tool find -f python | grep -C5 CONFIG_STATUS | grep content | grep -v "state....7\|state....15"
18. Edit the current Crontab file:
Crontab file: /var/spool/cron/crontabs/root
Open it via command: vi /var/spool/cron/crontabs/root
Remove the entry made via Part 2 - 16.2.
19. Stop and restart the current running CROND daemon on the vSAN host by
running these two commands (remove "$" prior running):
$ kill -HUP $(cat /var/run/crond.pid)
$ /usr/lib/vmware/busybox/bin/busybox crond