VMs go into an unstable hung state
search cancel

VMs go into an unstable hung state

book

Article ID: 399479

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • Multiple VMs go into a hung state with the Guest OS console reporting vmw_pvscsi ring full
  • This condition is observed on large virtual machines configured with 256 vCPUs and up to 1TB of memory, particularly when IOMMU is enabled.
  • The VMs will remain in a hung state and will not recover until a reboot is preformed 

/var/run/log/vmkernel.log: 

YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu249:3437498)PVSCSI: 2737: scsi0:0: SCSI ABORT ctx=0x39e
YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu249:3437498)PVSCSI: 2737: scsi0:0: SCSI ABORT ctx=0x340
YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu253:3437498)PVSCSI: 2737: scsi0:0: SCSI ABORT ctx=0x39d
YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu253:3437498)PVSCSI: 2737: scsi0:0: SCSI ABORT ctx=0x363
YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu250:3437494)PVSCSI: 2737: scsi0:0: SCSI ABORT ctx=0x38f
YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu250:3437494)PVSCSI: 2737: scsi0:0: SCSI ABORT ctx=0x37c

YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu97:3437599)VSCSI: 3439: handle 14763907134529565(GID:8221)(vscsi0:1):Reset request on FSS handle 58064454 (0 outstanding commands) from (vmm0:)
YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu97:3437599)VSCSI: 3484: handle 14763907134529565(GID:8221)(vscsi0:1):Added handle (refCnt = 3) to vscsiResetHandleList vscsiResetHandleCount = 1
YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu16:2098785)VSCSI: 3738: handle 14763907134529565(GID:8221)(vscsi0:1):processing reset for handle ... state 1381192707
YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu16:2098785)VSCSI: 3531: handle 14763907134529565(GID:8221)(vscsi0:1):Completing reset (0 outstanding commands)
YYYY-MM-DDTHH:MM:SS In(182) vmkernel: cpu16:2098785)VSCSI: 3589: handle 14763907134529565(GID:8221)(vscsi0:1):reset processed removed handle from vscsiResetHandleList 0

Environment

  • VMware vSphere ESXi 8.x
  • VMware vSphere ESXi 7.x

Cause

Virtual machines may enter a hung state due to the way address translation is managed by IOMMU when using 64-bit address space on AMD hardware.
In this scenario, the guest kernel transmits a memory address to the virtual IOMMU (vIOMMU) that exceeds the supported addressable range, leading to the failure.

Resolution

Add "iommu=pt" to the Linux kernel boot option as per Red Hat recommendations in VMware guest with large memory hangs