Virtual machines freeze intermittently or goes unresponsive under heavy I/O load

search cancel

Virtual machines freeze intermittently or goes unresponsive under heavy I/O load

book

Article ID: 327867

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Running the command on the ESXi host with the affected virtual machine, returns similar to:

$ ps -s | grep <vm-name>

4313969 vmm0: vm-name COSTOP NONE 0-63
4313971 vmm1:vm-name WAIT SCSI 0-63
4313972 4313957 vmx-vthread-5:vm-name WAIT UFUTEX 0-63 /bin/vmx
4314204 4313957 vmx-vthread-6:vm-name WAIT UFUTEX 0-63 /bin/vmx
4314205 4313957 vmx-vthread-7:vm-name WAIT UFUTEX 0-63 /bin/vmx
4314206 4313957 vmx-vthread-8:vm-name WAIT UFUTEX 0-63 /bin/vmx
4314210 4313957 vmx-mks:vm-name WAIT UPOL 0-63 /bin/vmx
4314212 4313957 vmx-svga:vm-name WAIT SEMA 0-63 /bin/vmx
4314214 4313957 vmx-vcpu-0:vm-name COSTOP NONE 0-63 /bin/vmx
4314215 4313957 vmx-vcpu-1:vm-name WAIT SCSI 0-63 /bin/vmx

Note: The vmm1 is blocked on a SCSI call (WAIT SCSI).

The following error may appear:

Unable to connect to the MKS: Error connecting to /bin/vmx process.

Virtual machines are unreachable over the network
Virtual machines may report an invalid state
Virtual machines are unresponsive

Environment

VMware ESXi Server 7.x.
VMware ESXi Server 8.x.

Cause

A virtual machine can be unresponsive due to:

Taking quiesced snapshots or using a custom quiescing script
Heavy I/O load on the ESXi hosts
Storage performance issues at the device, storage pool, and/or LUN level
One of the Virtual Machine Monitor (VMM) threads is blocked on a VSCSI call, the other VMM threads are co-stopped, waiting for the blocked thread to make progress

Resolution

Workaround

Caution: Ensure that there are no snapshot consolidation tasks running. Ensure no backups are running on the VMs during this time.

To recover the virtual machine from its locked state:

Find the process list for the virtual machine and check the cartel ID:

$ ps -s | grep <vm-name>

Note: Refer to the ps -s output mentioned in the Issue/Introduction section of this article.

Find the vmx-vcpu value that is waiting on the SCSI event.

Note: The number in the second column of the output is the cartel ID.

Run:$ kill -18 <cartel-ID>

to continue the process that has stopped.
After running the above steps the virtual machine may need to be reloaded. For more information see Reloading a vmx file without removing the virtual machine from inventory

Additional Information

Feedback

thumb_up Yes

thumb_down No