Single failed NVME disk can cause entire vSAN cluster to fault with I/O errors

Products

VMware vSAN

Issue/Introduction

Potential APD on all ESXi hosts and/or VM restarts due to lost heartbeats from VMware Tools.
VMs reboot due to storage APD causing heartbeat timeouts to VM VMware Tools.
Subset of VMs experience storage APDs due to stuck I/O.
In rare scenarios user may observe a race condition which may results in a PSOD of the host.
vSAN stuck descriptors messages and or STUCK_IO event may be found in vmkernel log:

2023-06-22T08:26:30.335Z cpu1:3668325)ScsiDeviceIO: 12527: Task mgmt request issued to device t10.NVMe____Dell_Express_Flash_NVMe_P4510_4TB_SFF___00014F5CA7E4D25C is stuck (WorldID 0, Cmd 0x28, CmdSN 3334). Issuing red notification to the application

2023-06-22T08:26:30.335Z cpu1:3668325)ScsiDeviceIO: 12559: FDS_DEV_EVENT_REPORT_STUCK_IO event for device t10.NVMe____Dell_Express_Flash_NVMe_P4510_4TB_SFF___00014F5CA7E4D25C

I/O errors may also appear in the logs:

2023-06-22T08:24:38.862Z cpu65:2195482)WARNING: NVMEIO:3581 Ctlr 262, nvmeCmd 0x45de216ec600 (opc 02), queue 1 (expect 65535) not available, nvmeStatus 80e
2023-06-22T08:24:38.862Z cpu65:2098248)WARNING: NVMEPSA:203 Complete vmkNvmeCmd: 0x45de216ec600, vmkPsaCmd: 0x45be801dac48, cmdId.initiator=0x430a24755010, CmdSN: 0x1970ee, status: 0x80e
2023-06-22T08:24:38.862Z cpu20:2100573 opID=adc801ff)Partition: 437: Failed read for "t10.NVMe____Dell_Express_Flash_NVMe_P4510_4TB_SFF___00014F5CA7E4D25C": I/O error
2023-06-22T08:24:38.862Z cpu20:2100573 opID=adc801ff)Partition: 1123: Failed to read protective mbr on "t10.NVMe____Dell_Express_Flash_NVMe_P4510_4TB_SFF___00014F5CA7E4D25C" : I/O error
2023-06-22T08:24:38.862Z cpu20:2100573 opID=adc801ff)WARNING: Partition: 1289: Partition table read from device t10.NVMe____Dell_Express_Flash_NVMe_P4510_4TB_SFF___00014F5CA7E4D25C failed: I/O error
2023-06-22T08:24:38.862Z cpu20:2100573 opID=adc801ff)WARNING: NVMEIO:3581 Ctlr 262, nvmeCmd 0x45be6f64c5c0 (opc 02), queue 1 (expect 65535) not available, nvmeStatus 80e

In some cases congestion may also be reported by the host:

2023-06-22T09:09:27.624Z: [vSANCorrelator] 789244634927us: [vob.vsan.lsom.congestionthreshold] LSOM SSDCong in 525b3aef-80ee-3579-ccd3-6a08508ab925 Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 243.
2023-06-22T09:09:27.624Z: [vSANCorrelator] 789249074685us: [esx.problem.vsan.lsom.congestionthreshold] LSOM SSDCong in 525b3aef-80ee-3579-ccd3-6a08508ab925 Congestion State: Exceeded. Congestion Threshold: 200 Current Congestion: 243.
2023-06-22T09:10:27.634Z: [vSANCorrelator] 789304644243us: [vob.vsan.lsom.congestionthreshold] LSOM SSDCong in 525b3aef-80ee-3579-ccd3-6a08508ab925 Congestion State: Normal. Congestion Threshold: 200 Current Congestion: 0.
2023-06-22T09:10:27.634Z: [vSANCorrelator] 789309084437us: [esx.problem.vsan.lsom.congestionthreshold] LSOM SSDCong in 525b3aef-80ee-3579-ccd3-6a08508ab925 Congestion State: Normal. Congestion Threshold: 200 Current Congestion: 0

vobd log may also report PDL events:

2023-06-22T08:24:02.921Z: [vSANCorrelator] 786519949736us: [vob.vsan.pdl.offline] vSAN device 528e93c3-4e51-93ad-c5e3-8a4d7eb2a60c has gone offline.
2023-06-22T08:24:02.922Z: [vSANCorrelator] 786524371797us: [esx.problem.vob.vsan.pdl.offline] vSAN device 528e93c3-4e51-93ad-c5e3-8a4d7eb2a60c has gone offline.
2023-06-22T08:24:02.922Z: [vSANCorrelator] 786519949748us: [vob.vsan.pdl.offline] vSAN device 525f86bd-592d-631b-3bb9-1399d4509366 has gone offline.
2023-06-22T08:24:02.922Z: [vSANCorrelator] 786524371895us: [esx.problem.vob.vsan.pdl.offline] vSAN device 525f86bd-592d-631b-3bb9-1399d4509366 has gone offline.
2023-06-22T08:24:02.922Z: [vSANCorrelator] 786519949774us: [vob.vsan.pdl.offline] vSAN device 5259a7a1-763c-4101-38cb-da53332fa649 has gone offline.
2023-06-22T08:24:02.922Z: [vSANCorrelator] 786524371939us: [esx.problem.vob.vsan.pdl.offline] vSAN device 5259a7a1-763c-4101-38cb-da53332fa649 has gone offline.
2023-06-22T08:24:04.915Z: [vSANCorrelator] 786521943100us: [vob.vsan.pdl.offline] vSAN device 5272ae62-10a6-6e84-f3ee-498673d9c5e8 has gone offline.
2023-06-22T08:24:04.915Z: [vSANCorrelator] 786526364980us: [esx.problem.vob.vsan.pdl.offline] vSAN device 5272ae62-10a6-6e84

Environment

VMware vSAN

Cause

Quiesce operations can not complete if I/O operation is still pending to the device. In some cases of stuck I/O, device may neither acknowledge completion of the I/O operation (ACK), nor it does honor ABORT request coming from the driver. This may cause a race condition.

Resolution

Issue may manifest seen in ESXi version 7.0.3 P07 and 8.0.x, fixed in version ESXi 7.0.3 P09 and ESXi 8.0.3.0 (8.0U3)

Workaround:

Power off all the virtual machines on the host if vMotion of the VMs is not possible.
Put impacted host into maintenance mode with "No Data Migration" as "Ensure Accessibility" will most likely fail.
If Maintenance Mode from UI does not work, node should be sent into Maintenance from CLI with:
localcli system maintenanceMode set -e true -m noAction
Once node is in Maintenance Mode a shutdown is preferred over reboot, to help with re-initialization of physical devices.
Power the node back up and attempt to vMotion VMs to another nodes.

Single failed NVME disk can cause entire vSAN cluster to fault with I/O errors

Article ID: 324271

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Additional Information

Feedback