Oracle RAC nodes reboot and ESXi host PSOD due to very high disk latency in vSAN clusters
search cancel

Oracle RAC nodes reboot and ESXi host PSOD due to very high disk latency in vSAN clusters

book

Article ID: 434020

calendar_today

Updated On:

Products

VMware vSAN VMware vSphere ESXi

Issue/Introduction

  • In a VMware vSAN environment, multiple Oracle Cluster nodes may experience unexpected reboots, accompanied by high disk latency across various VMs. In some cases, an ESXi host may encounter a Purple Screen of Death (PSOD).
  • Oracle RAC nodes reboot approximately 30–60 seconds after a storage latency event.
  • Oracle Cluster logs (ohasd_cssdmonitor_root.trc) show network and disk communication timeouts:

YYYY-MM-DDT HH:MM:SS [OCSSD(######)]CRS-1611: Network communication with node <Node Name> has been missing for 75% of the timeout interval.
YYYY-MM-DDT HH:MM:SS [OCSSD(######)]CRS-1613: No I/O has completed after 90% of the maximum interval. If this persists, voting file <Device Path> will be considered not functional.

  • /var/log/vmkernel.log for the vSCSI resets:

2026-01-29T22:40:57.411Z In(182) vmkernel: cpu35:2098019)VSCSI: 3738: handle 200294949453832814(GID:8814)(vscsi10:0):processing reset for handle ... state 1381192707
2026-01-29T22:40:57.411Z In(182) vmkernel: cpu35:2098019)VSCSI: 3845: handle 200294949453832814(GID:8814)(vscsi10:0):Reset [Retries: 0/0] from (vmm0:#######p00)
2026-01-29T22:41:25.459Z In(182) vmkernel: cpu35:2098019)VSCSI: 3531: handle 200294949453832814(GID:8814)(vscsi10:0):Completing reset (0 outstanding commands)
2026-01-29T22:41:25.459Z In(182) vmkernel: cpu35:2098019)VSCSI: 3589: handle 200294949453832814(GID:8814)(vscsi10:0):reset processed removed handle from vscsiResetHandleList 0

  • VMX logs of the VM reports VMware Tools timeout, command failures and a subsequent halt:

2026-01-29T22:41:36.469Z In(05) vmx - GuestRpc: app toolbox's second ping timeout; assuming app is down
2026-01-29T22:41:36.469Z In(05) vmx - Tools: [AppStatus] Last heartbeat value 18674844 (last received 28s ago)
2026-01-29T22:41:36.469Z In(05) vmx - TOOLS: appName=toolbox, oldStatus=2, status=0, guestInitiated=0.
2026-01-29T22:41:36.472Z In(05) vmx - GuestRpc: Reinitializing Channel 0(toolbox)
2026-01-29T22:41:36.472Z In(05) vmx - GuestMsg: Channel 0, Cannot unpost because the previous post is already completed
2026-01-29T22:41:36.472Z In(05) vmx - Tools: [AppStatus] Last heartbeat value 18674844 (last received 28s ago)
2026-01-29T22:41:36.472Z In(05) vmx - TOOLS: appName=toolbox, oldStatus=0, status=0, guestInitiated=0.
2026-01-29T22:41:43.775Z In(05) vcpu-80 - NVME-VMK: nvme0:9: WRITE Command failed. Status: 0x0/0x7.
2026-01-29T22:41:43.775Z In(05) vcpu-80 - NVME-VMK: nvme0:9: WRITE Command failed. Status: 0x0/0x7.
2026-01-29T22:41:43.779Z In(05) vcpu-0 - Vix: [vmxCommands.c:7175]: VMAutomation_HandleCLIHLTEvent. Do nothing.
2026-01-29T22:41:43.779Z In(05) vcpu-0 - MsgHint: msg.monitorevent.halt
2026-01-29T22:41:43.779Z In(05)+ vcpu-0 - The CPU has been disabled by the guest operating system. Power off or reset the virtual machine.

  • The ESXI host may experience a PSOD with a backtrace similar to:

2026-01-29T22:43:44.405Z cpu0:193429680)Backtrace for current CPU #0,worldID=193429680,fp=0x430bb4bae160
2026-01-29T22:43:44.405Z cpu0:193429680)0x453b69d9be10:[0x42001deb7986]PsaNvmeDeviceTaskMgmt@vmkernel#nover+0xd2 stack: 0xffffff, 0x0, 0x41ffdde507e0, 0x434d278013c0, 0x453b69d9be40
2026-01-29T22:43:44.405Z cpu0:193429680)0x453b69d9bea0:[0x42001deb827a]PsaNvmeDeviceTimeoutHandlerFn@vmkernel#nover+0x17f stack: 0x5d00000000, 0xbd3c6a822c0dfa, 0x41ffddec9f40, 0x1, 0x4200400016c0
2026-01-29T22:43:44.405Z cpu0:193429680)0x453b69d9bf60:[0x42001df1d029]PsaStorDeviceTimeoutHandlerFn@vmkernel#nover+0x62 stack: 0x0, 0x420000000cd7, 0x430bb4bac100, 0x10, 0x233
2026-01-29T22:43:44.405Z cpu0:193429680)0x453b69d9bfa0:[0x42001dfc2a9f]PsaStorTaskMgmtWorldFunc@vmkernel#nover+0x8c stack: 0x453afba1f100, 0x453b69d9f100, 0x0, 0x0, 0x0
2026-01-29T22:43:44.405Z cpu0:193429680)0x453b69d9bfe0:[0x42001e0d67b2]CpuSched_StartWorld@vmkernel#nover+0xbf stack: 0x0, 0x42001db44cf0, 0x0, 0x0, 0x0
2026-01-29T22:43:44.405Z cpu0:193429680)0x453b69d9c000:[0x42001db44cef]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0, 0x0, 0x0, 0x0, 0x0

Environment

  • VMware vSAN
  • Application Clustering enabled on the VMs (Oracle RAC, Veeam HA Cluster etc.)

Cause

  • The issue is triggered by extreme latency (ranging from 7 to 120+ seconds) on a specific storage device. When a drive experiences internal delays or failure, it may stop responding to I/O requests, aborts, and resets issued by the driver.
  • ESXi vmkernel.log reports CACHE_SLOW warnings from the ZDOM layer:

YYYY-MM-DDT HH:MM:SS Wa(180) vmkwarning: cpu##:########)WARNING: ZDOMBLKCACHE: CACHE_SLOW: X Block {UUID} blocked ##.# sec.

  • VM impact: VMs with data components on the slow drive experience I/O timeouts. Oracle RAC nodes, which are highly sensitive to "Misscount" and "Disktimeout" intervals, trigger a node eviction/reboot when heartbeats to voting disks or partner nodes fail.
  • Host PSOD: A rare race condition in the PSA (Pluggable Storage Architecture) occurs when an command takes longer than 120 seconds. If the command completes after the host has already issued a timeout/force-complete, it can trigger a PSOD in DeviceTaskMgmt.

Please note that the PSOD is a result of the extreme latency experienced on the disk and not the cause of this issue

 

 

Resolution