ESXi Host PSOD triggered on Excessive UNMAP Failures & Slow Storage Response

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

Latency has been deteriorated messages are seen within the vmkernel logs:

YYYY-MM-DD:HH:MM:SS.178Z cpu68:2098702)WARNING: ScsiDeviceIO: 1780: Device naa.############################# performance has deteriorated. I/O latency increased from average value of 1358 microseconds to 340625 microseconds.
YYYY-MM-DD:HH:MM:SS.445Z cpu109:2098707)WARNING: ScsiDeviceIO: 1780: Device naa.############################# performance has deteriorated. I/O latency increased from average value of 1367 microseconds to 98813 microseconds.
YYYY-MM-DD:HH:MM:SS.552Z cpu4:2098708)WARNING: ScsiDeviceIO: 1780: Device naa.############################# performance has deteriorated. I/O latency increased from average value of 1367 microseconds to 51917 microseconds.
YYYY-MM-DD:HH:MM:SS.547Z cpu71:2098702)WARNING: ScsiDeviceIO: 1780: Device naa.############################# performance has deteriorated. I/O latency increased from average value of 1363 microseconds to 569679 microseconds.

From the vmkernel logs, high amount of UNMAP failures (cmd 0x42) and command queueing with HOST_ABORT (0x5) / HOST_RESET (0x8) are followed along with the deterioration messages:

YYYY-MM-DD:HH:MM:SS.505Z In(182) vmkernel: cpu58:2097306)ScsiDeviceIO: 4670: Cmd(0x45db44f71d40) 0x42, cmdId.initiator=0x430bb5912d40 CmdSN 0x829d301 from world 2786901 to dev "naa.#############################" failed H:0x5 D:0x0 P:0x0 Cancelled from device layer
YYYY-MM-DD:HH:MM:SS.505Z In(182) vmkernel: cpu71:2097366)ScsiDeviceIO: 4670: Cmd(0x45db44ebef40) 0x42, cmdId.initiator=0x430bb5912d40 CmdSN 0x829d300 from world 2786901 to dev "naa.#############################" failed H:0x5 D:0x0 P:0x0 Cancelled from device layer

YYYY-MM-DD:HH:MM:SS.225Z In(182) vmkernel: cpu95:2097365)ScsiDeviceIO: 4605: Cmd(0x45bb5dfb9d80) 0x42, cmdId.initiator=0x430bb5912d40 CmdSN 0x829d3fc from world 2841660 to dev "naa.#############################" failed H:0x8 D:0x0 P:0x0 Cancelled from device layer
YYYY-MM-DD:HH:MM:SS.322Z In(182) vmkernel: cpu95:2097365)ScsiDeviceIO: 4605: Cmd(0x45db374084c0) 0x42, cmdId.initiator=0x430bb5912d40 CmdSN 0x829d4b9 from world 2790618 to dev "naa.#############################" failed H:0x8 D:0x0 P:0x0 Cancelled from device layer

PSOD Back trace:

YYYY-MM-DD:HH:MM:SS.971Z cpu63:6728669)0x453a0e01be70:[0x420017f6e587]MCSLockWait@vmkernel#nover+0x10f stack: 0x45ba59ba9540, 0x420017f6eb6e, 0x45ba59ba9540, 0x4200182e07ef, 0x100453a0e01bf00
YYYY-MM-DD:HH:MM:SS.971Z cpu63:6728669)0x453a0e01be90:[0x420017f6eb6d]MCSLockWork@vmkernel#nover+0x2a stack: 0x100453a0e01bf00, 0x420000000000, 0x26700000000, 0x188eb43c9d3f56, 0x430a754d5be8
YYYY-MM-DD:HH:MM:SS.971Z cpu63:6728669)0x453a0e01bea0:[0x4200182e07ee]PsaScsiDeviceTimeoutHandlerFn@vmkernel#nover+0x56f stack: 0x26700000000, 0x188eb43c9d3f56, 0x430a754d5be8, 0x40, 0x42004fc016c0
YYYY-MM-DD:HH:MM:SS.971Z cpu63:6728669)0x453a0e01bf60:[0x42001831fcc8]PsaStorDeviceTimeoutHandlerFn@vmkernel#nover+0x59 stack: 0x0, 0x420000000cd7, 0x430a754d5b40, 0x10, 0x209a1
YYYY-MM-DD:HH:MM:SS.971Z cpu63:6728669)0x453a0e01bfa0:[0x4200183c5fff]PsaStorTaskMgmtWorldFunc@vmkernel#nover+0x8c stack: 0x453a10a9f100, 0x453a0e01f100, 0x0, 0x0, 0x0
YYYY-MM-DD:HH:MM:SS.971Z cpu63:6728669)0x453a0e01bfe0:[0x4200184dc88e]CpuSched_StartWorld@vmkernel#nover+0xbf stack: 0x0, 0x420017f44fb0, 0x0, 0x0, 0x0
YYYY-MM-DD:HH:MM:SS.971Z cpu63:6728669)0x453a0e01c000:[0x420017f44faf]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0, 0x0, 0x0, 0x0, 0x0

PSOD backtrace is similar to:

#PF Exception 14
MCSLockWait@vmkernel
MCSLockWork@vmkernel
PsaScsiDeviceTimeoutHandlerFn@vmkernel
PsaStorDeviceTimeoutHandlerFn@vmkernel
PsaStorTaskMgmtWorldFunc@vmkernel
CpuSched_StartWorld@vmkernel
Debug_IsInitialized@vmkernel

Environment

VMware vSphere ESXi 8.0

Cause

This issue can arise when the level of UNMAP commands (0x42) generated in the vSphere environment is higher than the storage array can handle, and as a result there is performance deterioration, UNMAP IO pending and slow processing of aborts of delayed IO on the device.

Resolution

Investigate the cause of the elevated UNMAP (0x42) command rate, focusing on potential storage array overload and contributing factors such as VM UNMAP granularity, firmware, design or driver issues, and device-level anomalies.

Workaround: As a temporary measure, disable or lower Space Reclamation. Refer to the relevant VMware Knowledge Base article for instructions and important considerations before implementing this change:

How to throttle the unmap requests on Datastore ( Space Reclamation )