Virtual machines freeze intermittently or goes unresponsive under heavy I/O load
search cancel

Virtual machines freeze intermittently or goes unresponsive under heavy I/O load

book

Article ID: 327867

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

  • Running the ps -s | grep vm-name command on the ESXi host running the affected virtual machine returns similar to:
4313969 vmm0: vm-name COSTOP NONE 0-63
4313971 vmm1:vm-name WAIT SCSI 0-63
4313972 4313957 vmx-vthread-5:vm-name WAIT UFUTEX 0-63 /bin/vmx
4314204 4313957 vmx-vthread-6:vm-name WAIT UFUTEX 0-63 /bin/vmx
4314205 4313957 vmx-vthread-7:vm-name WAIT UFUTEX 0-63 /bin/vmx
4314206 4313957 vmx-vthread-8:vm-name WAIT UFUTEX 0-63 /bin/vmx
4314210 4313957 vmx-mks:vm-name WAIT UPOL 0-63 /bin/vmx
4314212 4313957 vmx-svga:vm-name WAIT SEMA 0-63 /bin/vmx
4314214 4313957 vmx-vcpu-0:vm-name COSTOP NONE 0-63 /bin/vmx
4314215 4313957 vmx-vcpu-1:vm-name WAIT SCSI 0-63 /bin/vmx


Note: The vmm1 is blocked on a SCSI call.
  • You see the error:
Unable to connect to the MKS: Error connecting to /bin/vmx process.
  • Virtual machines are unreachable over the network.
  • Virtual machines may report an invalid state.
  • Virtual machines are unresponsive.

Environment

VMware vSphere ESXi 7.x
VMware vSphere ESXi 8.x

Cause

A virtual machine can be unresponsive when:
  • Taking quiesced snapshots or using a custom quiescing script.
  • A heavy I/O load on the ESXi hosts
  • Storage performance issues  at the device, storage pool and/or LUN level.
  • One of the Virtual Machine Monitor (VMM) thread is blocked on a VSCSI call, the other VMM threads are co-stopped, waiting for the blocked thread to make progress.

Resolution

Caution: Ensure that there are no Snapshot consolidation task or backups are running on the VMs during this time. 

To recover the virtual machine from its locked-up state:
  1. Run this command to find the process list for the virtual machine and check the cartel ID:
ps -s | grep vm-name
 
Note: Refer to the ps -s output mentioned in the cause section of this Knowledge Base article.
 
  1. Find the vmx-vcpu value that is waiting on SCSI event.
 
Note: The number in the second column of the output is the cartel ID.
 
  1. Run kill -18 cartel-ID (SIGCONT) signal to the cartel to continue the process that has stopped.
  2. After running the above steps the virtual machine may need to be reloaded. For more information see Reloading a vmx file without removing the virtual machine from inventory (broadcom.com).

Notes:
  • Above mentioned steps is a workaround to recover VM from locked-up state.
  • For more information on SIGCONT, see Sending signal to Processes.



Additional Information