Virtual Machine unresponsiveness during vMotion memory transfer on VMs with large amount of memory allocated
search cancel

Virtual Machine unresponsiveness during vMotion memory transfer on VMs with large amount of memory allocated

book

Article ID: 427086

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • VM is experiencing periods of unresponsiveness during vMotion memory transfer
  • During vMotion memory transfer, a VM might experience periods of unresponsiveness while the guest is issuing large I/O requests. This might manifest in aborted guest I/O commands, application unavailability and degraded performance.
  • /var/run/log/vmkernel.log in the ESXi host shows below entries about preCopyNext action delay

    YYYY-MM-DDTHH:MM:SS.Z In(182) vmkernel: cpu133:45716049)PVSCSI: 2769: scsi1:2: SCSI ABORT ctx=0x363
    YYYY-MM-DDTHH:MM:SS.Z Wa(180) vmkwarning: cpu40:45715994)WARNING: VMotion: 1451: 8686083635737680100 S: Waited 30.315 seconds for the monitor to process a preCopyNext action.  This may cause unexpected vMotion failures.

  • In vmware.log for the Virtual Machine in /vmfs/volumes/<datastore>/<vmfolder>, scsi abort messages may also be seen

    YYYY-MM-DDTHH:MM:SS.Z In(05) vcpu-77 - PVSCSI: scsi3:2: aborting cmd 0x2dc
    YYYY-MM-DDTHH:MM:SS.Z In(05) vcpu-92 - PVSCSI: scsi2:2: aborting cmd 0x37a
    YYYY-MM-DDTHH:MM:SS.Z In(05) vcpu-53 - PVSCSI: scsi0:3: aborting cmd 0x2c3
    YYYY-MM-DDTHH:MM:SS.Z In(05) vcpu-54 - PVSCSI: scsi1:2: aborting cmd 0x363

  • /var/run/log/vmkernel.log on esxi host, Reset messages may also be seen

    YYYY-MM-DDTHH:MM:SS.Z In(182) vmkernel: cpu188:45716086)VSCSI: 3473: handle 196348699146723760(GID:8624)(vscsi0:3):Reset request on FSS handle 1892843346 (0 outstanding commands) from (vmm0:<VM-Name>)
    YYYY-MM-DDTHH:MM:SS.Z In(182) vmkernel: cpu188:45716086)VSCSI: 3518: handle 196348699146723760(GID:8624)(vscsi0:3):Added handle (refCnt = 3) to vscsiResetHandleList vscsiResetHandleCount = 1
    YYYY-MM-DDTHH:MM:SS.Z In(182) vmkernel: cpu4:2098403)VSCSI: 3772: handle 196348699146723760(GID:8624)(vscsi0:3):processing reset for handle ... state 1381192707
    YYYY-MM-DDTHH:MM:SS.Z In(182) vmkernel: cpu4:2098403)VSCSI: 3565: handle 196348699146723760(GID:8624)(vscsi0:3):Completing reset (0 outstanding commands)
    YYYY-MM-DDTHH:MM:SS.Z In(182) vmkernel: cpu24:45721478)VSCSI: 8589: handle 196348699146723760(GID:8624)(vscsi0:3):Destroying Device for world 45715994 (pendCom 0)
    YYYY-MM-DDTHH:MM:SS.Z In(182) vmkernel: cpu24:45721478)VSCSI: 8589: handle 196348699184472505(GID:8633)(vscsi3:2):Destroying Device for world 45715994 (pendCom 0)

Environment

  • ESXi 7.x
  • ESXi 8.x
  • ESX 9.0.x

Cause

High I/O activity during vMotion can overflow the buffer used for precise memory tracking. To maintain data integrity, the system falls back to a coarse-grained tracking mode. In this mode, a single change marks an entire large block of memory as modified rather than just a specific page. This "amplification" drastically increases the CPU overhead required to trace these memory changes, overwhelming the system and can cause guest I/O to stall or time out.

Resolution

The upcoming VCF 9.1 changes the way memory tracking is implemented during vMotion so the issue will be fully resolved in this version.