ESXi 8.0.1 Host Experiences Purple Screen of Death (PSOD) with Page Fault Exception During VMFS Operations

Products

VMware vSphere ESXi

Issue/Introduction

ESXi 8.0.1 hosts may experience a Purple Screen of Death (PSOD) with a Page Fault Exception during VMFS operations. This typically occurs during intensive I/O operations on VMFS-6 volumes.

Symptoms:
1. Host experiences a PSOD with "#PF Exception 14"
2. Backtrace shows involvement of VMFS modules
3. Error occurs in the context of resource management operations (res3HelperQu or FSUnmapManag worlds)

ESXi 8.0.1 hosts may experience a Purple Screen of Death (PSOD) with a Page Fault Exception during VMFS operations. This issue manifests in several distinctive patterns that can help with identification:

Pattern 1 - Resource Helper Queue PSOD:

#PF Exception 14 in world 2102584:res3HelperQu IP 0x4200135590e2 addr 0x100000008

Key identifying stack trace elements:

#0  DLM_free (msp=..., mem=..., allowTrim=...)
#1  Heap_Free (heap=...)
#2  FS3_HeapMemFree (heapID=...)
#3  FS3_MemFree (realPtr=...)
#4  Res6NewFreeClusterEntry (rce=...)
#5  Res6NewFlushCache (resType=...)
#6  Res6FlushCacheInt (resType=...)
#7  Res6FlushCache (resType=...)
#8  Res3_FlushCachesVMFS6 (fsData=...)
#9  Res3FlushHelperVMFS6 (data=...)

Pattern 2 - FSUnmapManager PSOD:

#PF Exception 14 in world 2097977:FSUnmapManag IP 0x4200019591ac addr 0xe1800000010

Key identifying stack trace elements:

#0  DLM_free (msp=...)
#1  Heap_Free (heap=...)
#2  FS3_HeapMemFree (heapID=...)
#3  FS3_MemFree (realPtr=...)
#4  UnmapAddClustersToProcess (unmapsToProcess=...)
#5  UnmapProcessFromCluster (listHead=...)
#6  ProcessFS_Unmaps ()
#7  UnmapManager (unused=...)

Pattern 3 - File IO Related PSOD:

#PF Exception 14 in world 3189220:fssAIO IP 0x420005158097 addr 0x10

Key identifying stack trace elements:

#0  tmalloc_large (nb=8512, m=...)
#1  DLM_malloc (msp=...)
#2  Heap_AlignWithTimeoutAndRA (heap=...)
#3  FS3_HeapMemAlloc (heapID=...)
#4  FS3_MemAlloc (size=...)
#5  Res6_InitCacheEntry (txn=...)
#6  Res6GetRC (txn=...)
#7  Res6_MemLockRC (resType=...)
```

Common Diagnostic Information:
1. vmkernel.log will typically show memory access errors:

Cannot access memory at address 0xe1800000010

2. Core dump analysis often reveals heap corruption with a specific chunk size:

FREE:  mchunkptr: 0x43174f3be5f0 (raw addr 0x43174f3be600); mchunk size=8512

3. Volume information can be gathered using:

vmkfstools -Ph /vmfs/volumes/<volume-name>

Example output showing affected volume:

VMFS-6.82 (Raw Major Version: 24) file system spanning 1 partitions.
Mode: public
Capacity X.X TB, Y.Y TB available, file block size 1 MB

These patterns typically appear during:

Heavy I/O operations
Resource cleanup operations
File system unmap operations
After approximately 20+ hours of sustained load
When multiple hosts are accessing the same VMFS volume

Environment

VMware ESXi 8.0.1 or older
VMware ESXi 7.0 Update 3n or older
VMFS-6 volumes
Can affect both iSCSI and NVMe storage configurations

Cause

A race condition in the VMFS resource management code can lead to premature freeing of memory resources while they are still in use. This occurs specifically during cluster resource entry (RCE) operations when multiple threads are accessing the same resources.

Resolution

Upgrade to ESXi build that includes fix for this issue:
- ESXi 7.0 Update 3o or newer
- ESXi 8.0 Update 2 or newer
If immediate upgrade is not possible, consider these temporary workarounds:
- Reduce concurrent I/O intensity on affected VMFS volumes
- Monitor for signs of high resource contention on VMFS volumes

Additional Information

This issue can affect any VMFS-6 volume regardless of the underlying storage type
The problem may be more likely to occur under heavy I/O load conditions
Multiple hosts accessing the same VMFS volume may increase the likelihood of encountering this issue