Failure scenario behavior with 4 fault domains, 2 disk groups per fault domain, and RAID5 as failure tolerance method
search cancel

Failure scenario behavior with 4 fault domains, 2 disk groups per fault domain, and RAID5 as failure tolerance method

book

Article ID: 344873

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

This KB article describes a specific failure scenario when using 4 fault domains, 2 disk groups in each fault domain, and a storage policy with a failure tolerance method of RAID5.

Symptoms:
In this example for this particular behavior each host is its own fault domain (FD), which is the default configuration in vSAN.


When a disk group fails in this scenario, affected components of RAID5 objects can only be rebuilt on the remaining disk group within the same host / FD.
Depending on the used disk space, the resulting rebuild can fill up the remaining disk group. With vSAN 6.7 U3 a new mechanism was implemented to pause such a rebuild, if the space usage on the remaining disk group reaches a configurable fullness threshold. Note that regular VM IO to the remaining disk group is not paused by this new mechanism and can still fill up the remaining disk space.

Environment

VMware vSAN 6.x
VMware vSAN 8.0.x
VMware vSAN 7.0.x

Cause

In such a design scenario, there's no other option for vSAN than to rebuild affected components on the remaining disk group in the same host / FD.

Resolution

There's several ways to handle this situation:
  1. Put the ESXi host, which has the failed disk group, into maintenance mode as soon as it's noticed that the disk group failed. This will prevent IO on the remaining disk group and therefore prevent it from running full. Note that this option will reduce redundancy, because no rebuild will take place.
  2. Add a 5th host / FD to the cluster, to avoid the constraint for the rebuild.
  3. If the remaining disk group has run full already, open a Support Request with VMware Technical support. For details on how to open a Support Request, see the following KB article: How to file a Support Request in Customer Connect


Additional Information

Release notes for vSAN 6.7 U3, which list the new mechanism of pausing a rebuild/resync when disk usage reaches a configurable fullness threshold: vSAN 6.7 U3 Release Notes

Similar behavior can be experienced when using a vSAN cluster design with the minimum number of fault domains for the storage policy and more than one disk group per fault domain.