vSAN RAID-6 Objects Inaccessible Following Abrupt Power Outage and Simultaneous Capacity Disk Failure
search cancel

vSAN RAID-6 Objects Inaccessible Following Abrupt Power Outage and Simultaneous Capacity Disk Failure

book

Article ID: 437376

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

 

  • Few vSAN objects are reported in an Inaccessible or Unknown state following a data center-wide power outage.

  • Skyline Health displays an Operational Health Warning.

  • A capacity disk on one of the host is in an Absent state.

  • Deduplication and Compression is enabled and the entire disk group associated with the failed disk is marked as unhealthy.

  • Remaining objects in the cluster are performing a resync, but specific objects do not progress or recover and are in inaccessible state.

  • The storage policy in use is RAID-6

 

Environment

VMware VSAN 8.x

Cause

The failure is caused by a violation of the RAID-6 (Erasure Coding) quorum requirements. In a vSAN RAID-6 (4+2) configuration, an object is distributed across six components (4 data, 2 parity). This policy allows the object to remain accessible if a maximum of two components are lost.

In this scenario, the combination of two factors led to data unavailability:

  1. Abrupt Power Shutdown: Placed various components into a transient Absent state across multiple hosts.

  2. Permanent Hardware Failure: A capacity disk on one of the hosts failed permanently during the power cycle.

Because deduplication is enabled, the loss of a single capacity disk invalidated the entire disk group. This resulted in more than two components becoming unavailable simultaneously for the impacted objects, exceeding the Failures to Tolerate (FTT=2) threshold.

Cause Validation:

To confirm the state of the components, the following vSAN management command can be used

esxcli vsan debug object list --all --health=inaccessible

Sample Output:

Object UUID: b6833f65-0028-38e3-da1f-xxxxxxxxxxxx
   Version: 15
   Health: inaccessible - Lost data availability.(APD)
   Owner: xxxxxxx
   Size: 0.00 GB
   Used: 3.54 GB
   Policy:
   Configuration:

      RAID_6
         Component: 265e7e68-d81f-9c6a-13e1-xxxxxxxxxxxx
           Component State: ABSENT,  Address Space(B): 68451041280 (63.75GB),  Disk UUID: 52fab51e-d37c-22ac-6f19-xxxxxxxxxxxx,  Disk Name: N/A
           Votes: 2,  Host UUID: None
         Component: b6833f65-fa71-d0e6-9164-xxxxxxxxxxxx
           Component State: ACTIVE,  Address Space(B): 68451041280 (63.75GB),  Disk UUID: 52ce3444-e8a5-7a40-16a1-xxxxxxxxxxxx,  Disk Name: naa.#############:2
           Votes: 1,  Capacity Used(B): 742391808 (0.69GB),  Physical Capacity Used(B): 734003200 (0.68GB),  Host Name: xxxxxxxxxxxx
         Component: b6833f65-3c2a-d5e6-ab8c-xxxxxxxxxxxx
           Component State: ABSENT,  CSN: STALE (981!=982),  Address Space(B): 68451041280 (63.75GB),  Disk UUID: 52d160bf-22ad-62d8-d3e7-xxxxxxxxxxxx,  Disk Name: naa.#############:2
           Votes: 1,  Capacity Used(B): 792723456 (0.74GB),  Physical Capacity Used(B): 784334848 (0.73GB),  Host Name: xxxxxxxxxxxx
         RAID_D
            Component: b6833f65-30b3-d9e6-c265-xxxxxxxxxxxx
              Component State: ABSENT,  CSN: STALE (976!=982),  Address Space(B): 68451041280 (63.75GB),  Disk UUID: 528a93b2-955c-19b3-3d61-xxxxxxxxxxxx,  Disk Name: naa.#############:2
              Votes: 1,  Capacity Used(B): 721420288 (0.67GB),  Physical Capacity Used(B): 713031680 (0.66GB),  Host Name: xxxxxxxxxxxx
            Component: 7c02d769-ced8-68d8-008e-xxxxxxxxxxxx
              Component State: ABSENT,  CSN: STALE (978!=982),  Address Space(B): 68451041280 (63.75GB),  Disk UUID: 5225b0e2-d929-11ea-a2ab-xxxxxxxxxxxx,  Disk Name: naa.#############:2
              Votes: 1,  Capacity Used(B): 25165824 (0.02GB),  Physical Capacity Used(B): 20971520 (0.02GB),  Host Name: xxxxxxxxxxxx
         Component: b6833f65-48a4-dde6-90c3-xxxxxxxxxxxx
           Component State: ACTIVE,  Address Space(B): 68451041280 (63.75GB),  Disk UUID: 52abf5be-ba18-800e-63db-xxxxxxxxxxxx,  Disk Name: naa.#############:2
           Votes: 1,  Capacity Used(B): 801112064 (0.75GB),  Physical Capacity Used(B): 792723456 (0.74GB),  Host Name: xxxxxxxxxxxx
         Component: d77a7565-22b0-5b2b-8644-xxxxxxxxxxxx
           Component State: ACTIVE,  Address Space(B): 68451041280 (63.75GB),  Disk UUID: 5298aabe-25b8-d39f-725d-xxxxxxxxxxxx,  Disk Name: naa.#############:2
           Votes: 1,  Capacity Used(B): 759169024 (0.71GB),  Physical Capacity Used(B): 750780416 (0.70GB),  Host Name: xxxxxxxxxxxx

Resolution

To address the inaccessible objects and disk failure, below steps should be performed.

1. Restore from Backup

Since more than two components of the RAID-6 stripe are missing or permanently lost due to the disk group failure, the data for these objects is mathematically incomplete and cannot be recovered by the vSAN layer.

  • Identify the Virtual Machines associated with the inaccessible Object IDs.

  • Initiate a Restore from Backup for the impacted VMs.

2. Hardware Remediation

  • Place the host with failed capacity disk in maintenance mode with ensure accessibility

  • Delete the unhealthy diskgroup

  • Replace the failed capacity disk

  • Recreate the disk group once the hardware is healthy to restore the cluster to full capacity and redundancy.

  • Monitor the vSAN Resyncing Objects dashboard to ensure all other data has finished resyncing.