vSAN Disk Group Offline Due to NVMe Unrecoverable Read Error 0x281

Products

VMware vSAN

Issue/Introduction

Symptoms:

A vSAN cluster utilizing deduplication and compression experienced a disk group failure. The host's Out-of-Band hardware management interface (e.g., iLO, iDRAC) does not report any physical hardware faults.
Analysis of the ESXi host logs confirms physical media read failures and subsequent vSAN software fault propagation.
The vmkernel.log excerpts show NVMe read command failures (status 0x281):

WARNING: NVMEIO:2645 command ##### failed: ctlr 256, queue 1, psaCmd ####, status 0x281, opc 0x2, cid 1160, nsid 1
WARNING: NVMEPSA:217 Complete vmkNvmeCmd: ######, vmkPsaCmd: ######, cmdId.initiator=#######, CmdSN: 0x358a3, status: 0x281
WARNING: HPP: HppNvmeThrottleLogForDevice:600: NVMe Cmd 0x2 (########, 0) to dev "#######" on path "vmhba2:C0:T0:L0" Failed:
WARNING: HPP: HppNvmeThrottleLogForDevice:608: Error status H:0x0 D:0x281 P:0x0 hppAction = 1
WARNING: NvmeUtil: 151: Error on Cmd(########) 0x2, CmdSN 0x358a3 from world 0 to component "#########" H:0x0 D:0x281 P:0x0

The vsandevicemonitor.log excerpts show Latent Sector Error (LSE) detection and evacuation failure:

In(14) vsandevicemonitord[2101934]: [238983484032]: Device ##### state is DG_PROPAGATED_UNHEALTHY_BY_LSE
In(14) vsandevicemonitord[2101934]: [238983484032]: Device ##### state is DISK_UNHEALTHY_BY_LSE
In(14) vsandevicemonitord[2101934]: [238983484032]: URE detected on: Dev ###### uuid <#####> Health 8192
In(14) vsandevicemonitord[2101934]: [239072446208]: Rebuilding the diskgroup ##### with evacReason Ure
In(14) vsandevicemonitord[2101934]: [238983484032]: Cannot auto remediate disk ###### for reason Ure, a remediation is already in progress on this host.
In(14) vsandevicemonitord[2101934]: [239072446208]: Evacuation failed with failure reason 13, for diskgroup ######, evacReason Ure
In(14)[+] vsandevicemonitord[2101934]: Unexpected error happened during rebuild disk group. Failed to evacuate data for disk uuid ###### with error: Busy, failure reason: 13

A single NVMe capacity drive encounters an Unrecoverable Read Error (URE), causing the entire disk group to be marked offline to prevent data corruption.

Environment

vSAN 8.x

Cause

A physical hardware failure characterized by an Unrecoverable Read Error (URE) / Latent Sector Error (LSE) on a single NVMe capacity drive caused this disk group failure. The NVMe specification dictates that status 0x281 indicates an Unrecovered Read Error, confirming a physical media and data integrity failure.

Because vSAN deduplication and compression configurations share a single hash domain across the entire disk group, the failure of a single capacity drive forces the system to offline the rest of the disk group to ensure data integrity.

The automated disk group rebuild (evacReason Ure) fails with failure reason: 13 ("Busy") because the underlying faulty physical medium is unresponsive, preventing read operations required for data evacuation.

Resolution

Replace the faulty disk from the affected disk group.

Place the ESXi host into Maintenance Mode utilizing the Ensure Accessibility option.
Delete the affected vSAN disk group from the vCenter Server UI. This removes the corrupted deduplication hash domain and halts the failing automated remediation loop.
Physically replace the faulty NVMe drive.
Recreate the vSAN disk group utilizing the new replacement capacity drive along with the original cache tier drive and the remaining healthy capacity drives to restore cluster storage policy compliance.

Additional Information

NVMe OpCodes and Status Definitions