NVMe soft failure or continuous reset can cause I/O to hang

Products

VMware vSAN

Issue/Introduction

This article is written to advise of this issue and provide the resolution.

Symptoms:

In vSAN 7.0 prior to 7.0 Update 2 Build # 17630552 extreme performance degradation may be seen when using NVMe devices and the following errors are in vmkernel.log (your dates/times and identifiers will be different), and the vsanmgmtd process on the host has crashed:

2021-03-02T03:46:22.025Z cpu43:1000876150)WARNING: NVMEPSA: NVMEPsaCommandDone:114: Complete vmkNvmeCmd: 0x45991e447900, vmkPsaCmd: 0x459920dc7240, cmdId.initiator=0x4538d071ab08, CmdSN: 0x0, status: 0x80d
2021-03-02T03:46:23.026Z cpu43:1000876150)WARNING: NVMEPSA: NVMEPsaCommandDone:114: Complete vmkNvmeCmd: 0x45991e41b380, vmkPsaCmd: 0x45992907c6c0, cmdId.initiator=0x4538d071ab08, CmdSN: 0x0, status: 0x80d
2021-02-05T12:20:41.623Z cpu26:2099832)NvmeScsi: 77: NVMe host error status: 0x6 translating to SCSI error H:0x8 repeated 409600 times on path vmhba17:C0:T0:L0 to namespace t10.NVMe____Dell_Express_Flash_CD5_3.84T_SFF________01222702E28EE38C
2021-02-05T12:20:46.054Z cpu31:2099832)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "t10.NVMe____Dell_Express_Flash_CD5_3.84T_SFF________01222702E28EE38C" state in doubt; requested fast path state update...
2021-02-05T12:20:48.501Z cpu19:2099832)NvmeScsi: 77: NVMe host error status: 0x6 translating to SCSI error H:0x8 repeated 819200 times on path vmhba17:C0:T0:L0 to namespace t10.NVMe____Dell_Express_Flash_CD5_3.84T_SFF________01222702E28EE38C
2021-02-05T12:20:52.400Z cpu1:2097420)ScsiDeviceIO: 4267: Cmd(0x45cec0fa0040) 0x28, CmdSN 0x1f2d37 from world 0 to dev "t10.NVMe____Dell_Express_Flash_CD5_3.84T_SFF________01222702E28EE38C" failed H:0x3 D:0x0 P:0x0
2021-02-05T12:20:52.400Z cpu66:5184830 opID=1c059d14)Partition: 437: Failed read for "t10.NVMe____Dell_Express_Flash_CD5_3.84T_SFF________01222702E28EE38C": I/O error
2021-02-05T12:20:52.400Z cpu66:5184830 opID=1c059d14)Partition: 1123: Failed to read protective mbr on "t10.NVMe____Dell_Express_Flash_CD5_3.84T_SFF________01222702E28EE38C" : I/O error
2021-02-05T12:20:52.400Z cpu66:5184830 opID=1c059d14)WARNING: Partition: 1282: Partition table read from device t10.NVMe____Dell_Express_Flash_CD5_3.84T_SFF________01222702E28EE38C failed: I/O error

In vmkernel.log controller reset:

2021-04-14T03:48:06.981Z cpu111:2098304)WARNING: NVMEDEV:2383 Controller cannot be disabled, status: Timeout
2021-04-14T03:48:06.981Z cpu111:2098304)WARNING: NVMEDEV:5567 Failed to disable controller 256, status: Timeout
2021-04-14T03:48:06.981Z cpu111:2098304)NVMEDEV:6373 Request to start controller 256 recovery
2021-04-14T03:48:06.981Z cpu111:2098304)NVMEDEV:6391 Starting controller 256 recovery.
2021-04-14T03:48:06.981Z cpu127:2099329)NVMEDEV:6222 recover controller 256
2021-04-14T03:48:06.981Z cpu127:2099329)NVMEDEV:5537 Resetting controller 256 (nqn.2017-01.jp.co.toshiba:KCD5XLUG3T84:10E0A051TAHR)
2021-04-14T03:48:06.981Z cpu127:2099329)NVMEDEV:5232 controller 256
2021-04-14T03:48:06.981Z cpu127:2099329)NVMEDEV:5392 Reset admin queue (controller 256)
2021-04-14T03:48:06.981Z cpu127:2099329)NVMEDEV:5310 controller 256, queue 0
2021-04-14T03:48:06.985Z cpu127:2099329)NVMEDEV:5384 Controller 256, queue 0 reset complete. Status Success
2021-04-14T03:48:06.985Z cpu127:2099329)NVMEDEV:877 Ctlr 256, queue 0, update queue size to 256

In the vobd.log errors around vSAN device repair:

2021-01-29T21:37:47.290Z: [vSANCorrelator] 960058145540us: [vob.vsan.lsom.devicerepair] vSAN device 52d86aa1-a57f-1da7-10b1-e########d is being repaired due to I/O failures, and will be out of service until the repair is complete. If the device is part of a dedup disk group, the entire disk group will be out of service until the repair is complete.
2021-01-29T21:37:47.291Z: [vSANCorrelator] 960056269254us: [esx.problem.vob.vsan.lsom.devicerepair] Device 52d86aa1-a57f-1da7-10b1-e########d is in offline state and is getting repaired.
2021-01-29T21:37:47.300Z: [vSANCorrelator] 960058155716us: [vob.vsan.pdl.offline] vSAN device 52d86aa1-a57f-1da7-10b1-e########d has gone offline.
2021-01-29T21:37:47.300Z: [vSANCorrelator] 960056278969us: [esx.problem.vob.vsan.pdl.offline] vSAN device 52d86aa1-a57f-1da7-10b1-e#########d has gone offline.

In vsandevicemonitord.log:

2021-01-29 21:46:35,226 INFO vsandevicemonitord stderr None, stdout b"VsanUtil::ReadFromDevice: Failed to open /vmfs/devices/disks/t10.NVMe____Dell_Express_Flash_CD5_3.84T_SFF________01142702E28EE38C, errno (5)\nVsanUtil::GetVsanDisks: Error occurred 'Failed to open device /vmfs/devices/disks/t10.NVMe____Dell_Express_Flash_CD5_3.84T_SFF________01142702E28EE38C', create disk with null id\nVsanUtil::ReadFromDevice: Failed to open /vmfs/devices/disks/t10.NVMe____Dell_Express_Flash_CD5_3.84T_SFF________01142702E28EE38C, errno (5)\nErrors: \nUnable to unmount: Failed to open device /vmfs/devices/disks/t10.NVMe____Dell_Express_Flash_CD5_3.84T_SFF________01142702E28EE38C\n" from command /sbin/localcli vsan storage diskgroup unmount -d t10.NVMe____Dell_Express_Flash_CD5_3.84T_SFF________01142702E28EE38C.2021-01-29 21:46:35,226 INFO vsandevicemonitord Unmounting failed on VSAN device t10.NVMe____Dell_Express_Flash_CD5_3.84T_SFF________01142702E28EE38C.

2021-01-29 21:46:35,301 INFO vsandevicemonitord Device t10.NVMe____Dell_Express_Flash_CD5_3.84T_SFF________01142702E28EE38C was already unmounted.

Environment

VMware vSAN 7.0.x

vSAN 7.0 builds prior to 7.0 Update 2 Build # 17630552

Cause

This situation is faced when an NVMe device is soft failing, or experiencing a continuous reset state without the driver failing the device out of use. This typically occurs with a firmware issue on the device which forces the device into an assert state, but does not report back to the driver that it is failed. At this point the device appears operational to the host, but is not processing any I/O.

When facing this situation the driver will repeatedly attempt to reset the controller. As the device is still reporting as available the driver would repeat the reset attempts indefinitely and the device would remain online and in use but not processing I/O.

Resolution

The issue is addressed in ESXi 7.0 Update 2 (Build # 17630552).

Workaround:

Rebooting the host or reseating a hot swappable drive may temporarily alleviate this issue. Please consult hardware vendor on why the NVMe device is experiencing failures.

Additional Information

Impact/Risks:

This issue can cause severe performance degradation to virtual machines running on the affected vSAN datastore to the point of the VMs hanging or crashing.