Understanding vSAN Stretched Cluster Failure Scenarios

Products

VMware vSAN

Issue/Introduction

When operating a vSAN stretched cluster environment, administrators need to understand how different failure scenarios impact virtual machine availability and data accessibility. This article provides detailed failure scenario tables showing the expected behavior for various failure types including host failures, site failures, witness failures, partition failures, and inter-site link (ISL) failures.

Administrators may observe:

Virtual machines becoming inaccessible during certain failure conditions
Different behaviors based on the Site Disaster Tolerance policy configuration
Varying impacts depending on whether Secondary Failures to Tolerate (FTT) is configured
Questions about data availability when multiple failures occur

Environment

VMware vSAN 8.X

VMware vSAN 7.X

Cause

The behavior during failure scenarios in vSAN stretched clusters is determined by the interaction between the Site Disaster Tolerance policy setting and the Secondary Failures to Tolerate (FTT) configuration. When failures occur, vSAN uses a voting mechanism to determine object availability based on component distribution across sites and the witness host. The specific combination of policy settings directly influences whether objects remain accessible and if virtual machines can continue running or need to be restarted during various failure conditions.

Resolution

Understanding the expected behavior for each failure scenario helps in planning disaster recovery strategies and setting appropriate storage policies. The following tables detail the behavior for each failure type based on policy configuration.

Host Failure Scenarios

Site Disaster Tolerance	Secondary FTT	VM Location	Failure	vSAN Behavior	VM Behavior
None - Preferred	No data redundancy	Site A or B	Host failure in Site A	Objects are inaccessible if the failed host contains one or more components of an object	VM cannot be restarted as the object is inaccessible
None - Preferred	RAID-1/5/6	Site A or B	Host failure in Site A	Objects are accessible as there is site-local resilience	VM does not need to be restarted unless VM was running on the failed host
Site Mirroring	No data redundancy	Site A or B	Host failure Site A or B	Components on failed hosts are inaccessible, read and write IO across ISL without local redundancy and rebuild across ISL	VM does not need to be restarted unless VM was running on the failed host
Site Mirroring	RAID-1/5/6	Site A or B	Host failure Site A or B	Components on failed hosts are inaccessible. Read IO locally due to RAID, and rebuild locally	VM does not need to be restarted unless VM was running on failed host

Partition Failure Scenarios

Site Disaster Tolerance	Secondary FTT	VM Location	Failure	vSAN Behavior	VM Behavior
None - Preferred	No data redundancy	Site B	Partition Site B	Objects are accessible in Site B	VM resides in Site B, does not need to be restarted
Site Mirroring	No data redundancy	Site A	Partition Site A	Objects are inaccessible in Site A as the full site is partitioned, and the quorum is lost	VM restarted in Site B
Site Mirroring	No data redundancy	Site B	Partition Site A	Objects are inaccessible in Site A as the full site is partitioned, and the quorum is lost	VM does not need to be restarted as it resides in Site B

Site Failure Scenarios

Site Disaster Tolerance	Secondary FTT	VM Location	Failure	vSAN Behavior	VM Behavior
None - Preferred	No data redundancy	Site A	Full failure Site A	Objects are inaccessible as the full site failed	VM cannot be restarted in Site B, as all objects reside in Site A
None - Preferred	No data redundancy	Site B	Full failure Site B	Objects are accessible, as only Site A contains objects	VM can be restarted in Site A, as that is where all objects reside
Site Mirroring	No data redundancy	Site A	Full failure Site A	Objects are inaccessible in Site A as full site failed	VM restarted in Site B
Site Mirroring	No data redundancy	Site B	Full failure Site A	Objects are inaccessible in Site A as the full site failed	VM does not need to be restarted as it resides in Site B
Site Mirroring	No data redundancy	Site A	Full failure in Site A and simultaneous host failure in Site B	Objects are inaccessible in Site A. If components reside on the failed host then the object is inaccessible in Site B	VM cannot be restarted
Site Mirroring	No data redundancy	Site A	Full failure in Site A and simultaneous host failure in Site B	Objects are inaccessible in Site A. If components do not reside on the failed host, then the object is accessible in Site B	VM restarted in Site B
Site Mirroring	RAID-1/5/6	Site A	Full failure in Site A and simultaneous host failure in Site B	Objects are inaccessible in Site A, accessible in Site B as there's site-local resiliency	VM restarted in Site B

Witness Failure Scenarios

Site Disaster Tolerance	Secondary FTT	VM Location	Failure	vSAN Behavior	VM Behavior
None - Preferred	No data redundancy	Site A	Witness host failure	No impact, witness host is not used as data is not replicated	No impact
None - Non-Preferred	No data redundancy	Site B	Witness host failure	No impact, the witness host is not used as data is not replicated	No impact
Site Mirroring	No data redundancy	Site A	Witness host failure	Witness object inaccessible, VM remains accessible	VM does not need to be restarted
Site Mirroring	No data redundancy	Site B	Witness host failure	Witness object inaccessible, VM remains accessible	VM does not need to be restarted
Site Mirroring	No data redundancy	Site A	Full failure Site A and simultaneous Witness Host Failure	Objects are inaccessible in Site A and Site B due to quorum being lost	VM cannot be restarted
Site Mirroring	No data redundancy	Site A	Full failure Site A followed by Witness Host Failure a few minutes later	Pre vSAN 7.0 U3: Objects are inaccessible in Site A and Site B due to quorum being lost	VM cannot be restarted
Site Mirroring	No data redundancy	Site A	Full failure Site A followed by Witness Host Failure a few minutes later	Post vSAN 7.0 U3: Objects are inaccessible in Site A, but accessible in Site B as votes have been recounted	VM restarted in Site B
Site Mirroring	No data redundancy	Site B	Full failure Site B followed by Witness Host Failure a few minutes later	Post vSAN 7.0 U3: Objects are inaccessible in Site B, but accessible in Site A as votes have been recounted	VM restarted in Site A

Inter-Site Link (ISL) Failure Scenarios

Site Disaster Tolerance	Secondary FTT	VM Location	Failure	vSAN Behavior	VM Behavior
Site Mirroring	No data redundancy	Site A	Network failure between Site A and B (ISL down)	Site A binds with the witness, and objects in Site B become inaccessible	VM does not need to be restarted
Site Mirroring	No data redundancy	Site B	Network failure between Site A and B (ISL down)	Site A binds with the witness, and objects in Site B become inaccessible	VM restarted in Site A

Adaptive Quorum Control

vSAN 7 U3 introduced Adaptive Quorum Control (AQC) to improve data availability during specific failure conditions. This feature maintains data availability of objects during a site failure (or maintenance) followed by subsequent unavailability of the witness host.

In a fully operational stretched cluster, quorum is determined through a voting mechanism that accounts for object components in both sites and the witness host appliance. When a data site experiences a planned or unplanned outage, vSAN adjusts the votes to favor the active site that still has quorum. This adjustment allows sufficient votes to maintain quorum and keeps data available during a planned or unplanned outage of the witness host appliance.

The vote adjustment process may take a few seconds to a few minutes depending on cluster size. As each object completes adjustment, that object can tolerate witness host failure while maintaining availability. This capability does not protect against simultaneous failure of a data site and witness.

Recovery from Complex Failures

In conditions of a double site failure, where one data site fails simultaneously with the witness site, data and VMs become unavailable as they cannot achieve quorum. This protection mechanism prevents updating data in two different locations.

There may be a chance to recover the data in the single remaining site when it is known that the other data site and the witness site are not coming back. For all versions up to and including vSAN 8 U3 (VCF 5.2), this involves contacting Global Support (GS) to determine viability of potential recovery. Please note, this is a best-effort situation and does not guarantee sanity of data inside guest VMs when recovering from stale components.

Additional Information

For vSAN stretched clusters, avoid using a storage policy with locality=none. When using a storage policy with locality=none, the components of the same replica can be spread across both data sites in the cluster. This can result in:

Undesired issues during reconfiguring tasks of an object such as storage policy changes
Issues when placing a host into maintenance mode with ensure accessibility
Possibility of objects going inaccessible during planned maintenance
Read locality not being guaranteed as reads may go across data sites via the Inter-Site Link (ISL), resulting in latency

In the case of a storage policy with Site disaster tolerance set to one of the below options with Failures to tolerate set to RAID1/5/6, the writes would be limited to the site to which the locality is set:

Dual site mirroring (stretch cluster)
None - keep data on Preferred (stretch cluster)
None - keep data on Secondary (stretch cluster)

The issue is specific to stretch cluster storage policies set as Site disaster tolerance with either "None - standard cluster" or "None - stretched cluster" with a Failures to tolerate set to RAID1/5/6.

For more details, see: