ESXi host might fail with a purple diagnostic screen while getting SMART statistics from an NVMe drive
search cancel

ESXi host might fail with a purple diagnostic screen while getting SMART statistics from an NVMe drive

book

Article ID: 424457

calendar_today

Updated On:

Products

VMware vSphere ESXi 8.0

Issue/Introduction

If a synchronous command such as VMK_NVME_ADMIN_CMD_GET_LOG_PAGE delays for more than 120 seconds due to a bad drive or any other reason while

fetching SMART statistics from an NVMe drive, ESXi hosts might fail with a purple diagnostic screen

Environment

ESXi 8.x

Cause

When certain NVMe synchronous commands are issued, the command may occasionally become unresponsive. To prevent the application from hanging, the system releases the wait and notifies the relevant module so processing can continue.

In this situation, the command may still complete later. However, the system incorrectly treated the command as finished and freed the associated memory resources.

When the NVMe device eventually completed the command, the system attempted to access memory that had already been released, which resulted in a PSOD.

 

Resolution

This issue is resolved by properly detecting the stuck I/O condition and ensuring that the associated resources are not released until the command has fully completed. The fix is included in ESXi 8.0 Update 3h.