vSAN RAID-6 Objects Inaccessible Following Abrupt Power Outage and Simultaneous Capacity Disk Failure

Products

VMware vSAN

Issue/Introduction

Symptoms:

Few vSAN objects are reported in an Inaccessible or Unknown state following a data center-wide power outage.
Skyline Health displays an Operational Health Warning.
A capacity disk on one of the host is in an Absent state.
Deduplication and Compression is enabled and the entire disk group associated with the failed disk is marked as unhealthy.
Remaining objects in the cluster are performing a resync, but specific objects do not progress or recover and are in inaccessible state.
The storage policy in use is RAID-6

Environment

VMware VSAN 8.x

Cause

The failure is caused by a violation of the RAID-6 (Erasure Coding) quorum requirements. In a vSAN RAID-6 (4+2) configuration, an object is distributed across six components (4 data, 2 parity). This policy allows the object to remain accessible if a maximum of two components are lost.

In this scenario, the combination of two factors led to data unavailability:

Abrupt Power Shutdown: Placed various components into a transient Absent state across multiple hosts.
Permanent Hardware Failure: A capacity disk on one of the hosts failed permanently during the power cycle.

Because deduplication is enabled, the loss of a single capacity disk invalidated the entire disk group. This resulted in more than two components becoming unavailable simultaneously for the impacted objects, exceeding the Failures to Tolerate (FTT=2) threshold.

Cause Validation:

To confirm the state of the components, the following vSAN management command can be used

esxcli vsan debug object list --all --health=inaccessible

Sample Output:

Object UUID: b6833f65-0028-38e3-da1f-xxxxxxxxxxxx
Version: 15
Health: inaccessible - Lost data availability.(APD)
Owner: xxxxxxx
Size: 0.00 GB
Used: 3.54 GB
Policy:
Configuration:

RAID_6
Component: 265e7e68-d81f-9c6a-13e1-xxxxxxxxxxxx
Component State: ABSENT, Address Space(B): 68451041280 (63.75GB), Disk UUID: 52fab51e-d37c-22ac-6f19-xxxxxxxxxxxx, Disk Name: N/A
Votes: 2, Host UUID: None
Component: b6833f65-fa71-d0e6-9164-xxxxxxxxxxxx
Component State: ACTIVE, Address Space(B): 68451041280 (63.75GB), Disk UUID: 52ce3444-e8a5-7a40-16a1-xxxxxxxxxxxx, Disk Name: naa.#############:2
Votes: 1, Capacity Used(B): 742391808 (0.69GB), Physical Capacity Used(B): 734003200 (0.68GB), Host Name: xxxxxxxxxxxx
Component: b6833f65-3c2a-d5e6-ab8c-xxxxxxxxxxxx
Component State: ABSENT, CSN: STALE (981!=982), Address Space(B): 68451041280 (63.75GB), Disk UUID: 52d160bf-22ad-62d8-d3e7-xxxxxxxxxxxx, Disk Name: naa.#############:2
Votes: 1, Capacity Used(B): 792723456 (0.74GB), Physical Capacity Used(B): 784334848 (0.73GB), Host Name: xxxxxxxxxxxx
RAID_D
Component: b6833f65-30b3-d9e6-c265-xxxxxxxxxxxx
Component State: ABSENT, CSN: STALE (976!=982), Address Space(B): 68451041280 (63.75GB), Disk UUID: 528a93b2-955c-19b3-3d61-xxxxxxxxxxxx, Disk Name: naa.#############:2
Votes: 1, Capacity Used(B): 721420288 (0.67GB), Physical Capacity Used(B): 713031680 (0.66GB), Host Name: xxxxxxxxxxxx
Component: 7c02d769-ced8-68d8-008e-xxxxxxxxxxxx
Component State: ABSENT, CSN: STALE (978!=982), Address Space(B): 68451041280 (63.75GB), Disk UUID: 5225b0e2-d929-11ea-a2ab-xxxxxxxxxxxx, Disk Name: naa.#############:2
Votes: 1, Capacity Used(B): 25165824 (0.02GB), Physical Capacity Used(B): 20971520 (0.02GB), Host Name: xxxxxxxxxxxx
Component: b6833f65-48a4-dde6-90c3-xxxxxxxxxxxx
Component State: ACTIVE, Address Space(B): 68451041280 (63.75GB), Disk UUID: 52abf5be-ba18-800e-63db-xxxxxxxxxxxx, Disk Name: naa.#############:2
Votes: 1, Capacity Used(B): 801112064 (0.75GB), Physical Capacity Used(B): 792723456 (0.74GB), Host Name: xxxxxxxxxxxx
Component: d77a7565-22b0-5b2b-8644-xxxxxxxxxxxx
Component State: ACTIVE, Address Space(B): 68451041280 (63.75GB), Disk UUID: 5298aabe-25b8-d39f-725d-xxxxxxxxxxxx, Disk Name: naa.#############:2
Votes: 1, Capacity Used(B): 759169024 (0.71GB), Physical Capacity Used(B): 750780416 (0.70GB), Host Name: xxxxxxxxxxxx

Resolution

To address the inaccessible objects and disk failure, below steps should be performed.

1. Restore from Backup

Since more than two components of the RAID-6 stripe are missing or permanently lost due to the disk group failure, the data for these objects is mathematically incomplete and cannot be recovered by the vSAN layer.

Identify the Virtual Machines associated with the inaccessible Object IDs.
Initiate a Restore from Backup for the impacted VMs.

2. Hardware Remediation

Place the host with failed capacity disk in maintenance mode with ensure accessibility
Delete the unhealthy diskgroup
Replace the failed capacity disk
Recreate the disk group once the hardware is healthy to restore the cluster to full capacity and redundancy.
Monitor the vSAN Resyncing Objects dashboard to ensure all other data has finished resyncing.