On occasion, for maintenance or troubleshooting, it is necessary to temporarily take an ESXi host out of a vSAN-enabled cluster.
Because of the nature of vSAN as a shared datastore across all ESXi hosts, this can create uncertainty about what can and cannot be done, and how to proceed.
This article is intended to break down the different scenarios and options when an ESXi host needs to be removed from a vSAN cluster temporarily.
This applies to any vSphere cluster with a vSAN datastore enabled.
Need to temporarily remove an ESXi host from a vSAN cluster to perform a task or tasks. Possible reasons include:
- Perform some maintenance.
- Reboot the ESXi host.
- Perform an upgrade - hardware or software.
- Troubleshooting.
Before rebooting an ESXi host, or making any significant change, it should be put into Maintenance Mode.
Note: The below details specifically deal with vSAN, and don't take into account VMs running on the ESXi host, which will need to be migrated away from the ESXi hosts to get it into Maintenance Mode.
vSAN has three different Maintenance Mode options:
- Full Data Evacuation: This copies all data from the vSAN disks on the ESXi host to the rest of the vSAN datastore.
- Safest option, but takes the most time.
- Ensure Accessibility: This copies data for any vSAN objects that would go inaccessible without the data on this ESXi host, but does not recreate full redundancy.
- All vSAN objects will be accessible, but some will have reduced availability until either the ESXi host comes back out of Maintenance Mode, or the data is rebuilt.
- If the cluster has a storage policy with redundancy (see Additional Information) and all vSAN objects are healthy, the ESXi host will go into Maintenance Mode almost instantly.
- All vSAN objects remain accessible, but redundancy is reduced. For example, if another ESXi host has a problem or a physical disk fails, while this host is in Maintenance Mode, some data may become inaccessible.
- No Data Migration: Any vSAN objects on this ESXi host without redundancy will immediately become inaccessible.
- ESXi host will go into Maintenance Mode instantly.
- Generally not recommended.
Once an ESXi host is in Maintenance Mode, whatever maintenance work needs to be performed on it can be done safely with no impact to vSAN.
If the ESXi host is only intended to be out of the vSAN cluster for a short time, Ensure Accessibility is generally the preferred option, but the customer will need to make the decision for their own environment.
vSAN has a rebuild timer that will recreate the data for a vSAN object in reduced availability after a certain amount of time. (Default = 60 minutes) What this means is that, if the ESXi host remains in Maintenance Mode after 60 minutes, the data will be rebuilt on the other ESXi hosts. If this needs to be prevented, the rebuild time-out can be extended, but customers need to aware that the risk of a second failure leading to data inaccessibility increases the longer a vSAN object is in reduced availability.
vSAN is a shared storage design, so the disks on each ESXi host are all included in one vSAN datastore. For data integrity, vSAN objects are made up of components, which are placed on physical disks on different ESXi hosts. The number of vSAN components, and their placement, for a vSAN object depends on the Storage Policy.
vSAN Storage Policies have several settings. The key one for this scenario is Failures To Tolerate (FTT).
- A vSAN object with FTT=0 does not have any redundancy; i.e. cannot tolerate a single failure.
- A vSAN object with FTT=1 has redundancy; there are two copies of the data (components) on the vSAN datastore on different ESXi hosts so, if one ESXi host goes down, the vSAN object will remain accessible.
- Other options include RAID-5 (can tolerate one ESXi host failure), RAID-6 and FTT=2 which can both tolerate two ESXi hosts failing, and keep the vSAN object accessible.
Much more information about vSAN Storage Policies can be found at docs.vmware.com.
Example 1: What happens to a vSAN object with a storage policy of FTT=1 when an ESXi host is put into Maintenance Mode with each of the three options above?
- Full Data Evacuation: The vSAN component on this ESXi host for the vSAN object will be copied to another ESXi host. Storage policy of FTT=1 remains in place, and honored.
- Ensure Accessibility: If the vSAN object is fully healthy, the vSAN component on this ESXi host for the vSAN object will not be copied to another ESXi host. The vSAN object remains accessible, but the storage policy of FTT=1 is not honored. If the vSAN object is not fully healthy, the component from this ESXi host will be rebuilt on another ESXi host, ensuring the vSAN object remains accessible.
- No Data Migration: Same as Ensure Accessibility if the object is healthy. If the object is not healthy, the component will still not be copied, so the object will become inaccessible.
Example 2: What happens to a vSAN object with a storage policy of FTT=0 when an ESXi host is put into Maintenance Mode with each of the three options above?
- Full Data Evacuation: The vSAN component on this ESXi host for the vSAN object will be copied to another ESXi host. Storage policy of FTT=0 remains in place, and honored.
- Ensure Accessibility: The vSAN component on this ESXi host for the vSAN object will be copied to another ESXi host. Storage policy of FTT=0 remains in place, and honored.
- No Data Migration: No data is copied, and the vSAN object will become inaccessible.
While there may be specific requirements for certain environments, for tasks where the ESXi host will only be removed from contributing to the vSAN cluster for a short time, the Ensure Accessibility option is generally suitable for this type of activity.
- For example, reinstallation of NSX on the ESXi host, installation or removal of a software VIB from the ESXi host.