Repeated All-Paths-Down (APD) errors on Micron NVME drives

Products

VMware vSphere ESXi VMware vSphere ESXi 8.0

Issue/Introduction

Symptoms:

You have NVMe drives installed locally in the ESXi host.
The hosts are reporting "All-Paths-Down" status for the NVMe drive(s):

2025-09-25T21:27:01.197Z In(14) vobd[2099683]: [vmfsCorrelator] 6927271523835us: [vob.vmfs.heartbeat.timedout] 68681008-ed818838-caad-1423f2c0b230 LOCAL_DATASTORE01
2025-09-25T21:27:01.197Z In(14) vobd[2099683]: [vmfsCorrelator] 6927390774021us: [esx.problem.vmfs.heartbeat.timedout] 68681008-ed818838-caad-1423f2c0b230 LOCAL_DATASTORE01
2025-09-25T21:27:41.583Z In(14) vobd[2099683]: [APDCorrelator] 6927311909286us: [vob.storage.apd.start] Device or filesystem with identifier [t10.NVMe____XXXXXXXXXXXXXXXXXXXXXXXXX________________YYYYYYYYYYYYYYYYY] has entered the All Paths Down state.
2025-09-25T21:27:41.583Z In(14) vobd[2099683]: [psastorCorrelator] 6927311909218us: [vob.psastor.psastorpath.pathstate.dead] storagePath vmhba1:C0:T0:L0 changed state from on (device ID: t10.NVMe____XXXXXXXXXXXXXXXXXXXXXXXXX________________YYYYYYYYYYYYYYYYY)
2025-09-25T21:27:41.583Z In(14) vobd[2099683]: [APDCorrelator] 6927431159954us: [esx.problem.storage.apd.start] Device or filesystem with identifier [t10.NVMe____XXXXXXXXXXXXXXXXXXXXXXXXX________________YYYYYYYYYYYYYYYYY] has entered the All Paths Down state.
2025-09-25T21:27:41.592Z In(14) vobd[2099683]: [psastorCorrelator] 6927431168453us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device t10.NVMe____XXXXXXXXXXXXXXXXXXXXXXXXX________________YYYYYYYYYYYYYYYYY. Path vmhba1:C0:T0:L0 is down. Affected datastores: "LOCAL_DATASTORE01".
2025-09-25T21:30:01.587Z In(14) vobd[2099683]: [APDCorrelator] 6927451910165us: [vob.storage.apd.timeout] Device or filesystem with identifier [t10.NVMe____XXXXXXXXXXXXXXXXXXXXXXXXX________________YYYYYYYYYYYYYYYYY] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.
2025-09-25T21:30:01.587Z In(14) vobd[2099683]: [APDCorrelator] 6927571163551us: [esx.problem.storage.apd.timeout] Device or filesystem with identifier [t10.NVMe____XXXXXXXXXXXXXXXXXXXXXXXXX________________YYYYYYYYYYYYYYYYY] has entered the All Paths Down Timeout state after being in the All Paths Down state for 140 seconds. I/Os will now be fast failed.

You see entries in the vmkernel.log similar to the following showing a 0x808 status similar to the following::

2025-09-25T20:58:13.167Z Wa(180) vmkwarning: cpu298:2100026)WARNING: HPP: HppNvmeThrottleLogForDevice:600: NVMe Cmd 0x2 (0x45db81be0cc0, 4358304) to dev "t10.NVMe____XXXXXXXXXXXXXXXXXXXXXXXXX________________YYYYYYYYYYYYYYYYY" on path "vmhba1:C0:T0:L0" Failed:

2025-09-25T20:58:13.167Z Wa(180) vmkwarning: cpu298:2100026)WARNING: HPP: HppNvmeThrottleLogForDevice:608: Error status H:0x9 D:0x0 P:0x0 hppAction = 1

2025-09-25T20:58:13.167Z Wa(180) vmkwarning: cpu298:2100026)WARNING: NVMEPSA:217 Complete vmkNvmeCmd: 0x45bb8f7a90c0, vmkPsaCmd: 0x45db81be0cc0, cmdId.initiator=0x4311c8c17d80, CmdSN: 0x120, status: 0x808

You may also see" Virtual resets" being issued to the NVMe device and command tasks being aborted in conjunction with the above errors:

2025-09-25T21:27:01.197Z In(182) vmkernel: cpu3:2100134)NvmeDeviceIO: 3315: Virt reset issued on device t10.NVMe____XXXXXXXXXXXXXXXXXXXXXXXXX________________YYYYYYYYYYYYYYYYY

2025-09-25T21:27:01.197Z In(182) vmkernel: cpu103:2099594)NVMEPSA:1345 taskMgmt:abort cmdId.initiator=0x430c72ac9f00 CmdSN 0x557fe25 world:2097544 controller 261 state:5 nsid:1

2025-09-25T21:27:01.197Z In(182) vmkernel: cpu103:2099594)NVMEIO:3974 Ctlr 261, ns 1, tmReq 0x431f44f2c320, type 1, initiator 0x430c72ac9f00, sn 0x557fe25, world id 2097544.

2025-09-25T21:27:01.197Z In(182) vmkernel: cpu103:2099594)NvmeUtil: 470: Transient status for command 0x5 set to VMK_ABORTED because retries were inhibited (e.g., by a guest virtual device reset): cmdId.initiator=0x430c72ac9f00 cmdId.serialNumber=0x557fe25)

The I/O being issued to the datastore is minimal.

Environment

VMware ESXi 8.x

Cause

The "status: 0x808" in the above messages indicates that the drives are busy, thus they are unable to respond to storage commands, so those commands are ultimately timing out.
This results in guests issuing SCSI resets on the device(s) from the guest layer. to clear any pending reservations, clear the pending I/O, etc., to attempt to return the device to an initial power-on state.
This is a 3rd party issue.

Resolution

Please work with your hardware vendor to investigate the cause of the drive busyness.

Additional Information

See KB for a list of common NVMe opcodes/status codes: NVMe OpCodes and Status Definitions