ESXi hosts encounter numerous command aborts and timeouts on Pure Storage Arrays using NVME-oF
search cancel

ESXi hosts encounter numerous command aborts and timeouts on Pure Storage Arrays using NVME-oF

book

Article ID: 388546

calendar_today

Updated On:

Products

VMware vSphere ESXi VMware vSphere ESXi 8.0 VMware vSphere ESXi 7.0 VMware vSphere ESXi 6.0

Issue/Introduction

Pure Storage All-Flash arrays allow customers to use NVME-oF to connect to storage, instead of FC/TCP, as a way of increased performance for high I/O loads.

The issue manifests in a storage outage. Storage may disconnect or go offline, or, be at a complete stand still as commands start to abort and timeout reaching the array.

To troubleshoot a match of this issue, you can attempt to find this opcode 0x7f in the vmkernel log, as seen below. Many should be seen when the issue is hit:

vmkernel.1:YYYY-MM-SSTTHH:MM:SSZIn(182) vmkernel: cpu116:2099160)nvmetcp:nt_HandleAdminCmdTimeout:6009 [ctlr 267, queue 3] txPdu 0x431e1f9d75c0, vmkCmd 0x45dd401e8a40(13), opcode 0x7f, timed out.

Environment

VMware vSphere ESXi 8.x

VMware vSphere ESXi 7.x

Purity Release 6.7.1 or earlier

Cause

Pure has identified a now known issue with the buffers on the storage devices not clearing after commands complete. This causes these buffers to fill up, causing further commands to never complete and thus the abort and timeout "storm" that suffers on the ESXi environment using these NVMEoF devices. ESXi service restarts does appear to clear this issue temporarily as the buffers do clear on a log out and log in from the array (via the driver).

Resolution

The full resolution is to have the customer upgrade to the latest Purity update.

This has been fully addressed in the Purity "long life release" (or LLR) version 6.7.2.