Virtual machines might become unresponsive due to a rare deadlock issue in a VMFS6 volume

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

VM(s) randomly become unresponsive when they are using thin VMDK files on VMFS6
The /var/log/vmkernel.log is flooded with resetting handle messages that go on indefinitely:

2023-04-05T05:01:26.653Z cpu57:8916482)VSCSI: 2973: handle 38295998585421404(GID:48732)(vscsi0:0):Added handle (refCnt = 3) to vscsiResetHandleList vscsiResetHandleCount = 1
2023-04-05T05:01:26.653Z cpu14:2097732)VSCSI: 3226: handle 38295998585421404(GID:48732)(vscsi0:0):processing reset for handle ... state 1381192707
2023-04-05T05:01:26.653Z cpu14:2097732)VSCSI: 3335: handle 38295998585421404(GID:48732)(vscsi0:0):Reset [Retries: 0/0] from (vmm0:SQLVM1)
2023-04-05T05:01:27.157Z cpu14:2097732)VSCSI: 3226: handle 38295998585421404(GID:48732)(vscsi0:0):processing reset for handle ... state 1381192706
2023-04-05T05:01:27.659Z cpu14:2097732)VSCSI: 3226: handle 38295998585421404(GID:48732)(vscsi0:0):processing reset for handle ... state 1381192706
2023-04-05T05:01:28.161Z cpu14:2097732)VSCSI: 3226: handle 38295998585421404(GID:48732)(vscsi0:0):processing reset for handle ... state 1381192706
2023-04-05T05:01:28.663Z cpu14:2097732)VSCSI: 3226: handle 38295998585421404(GID:48732)(vscsi0:0):processing reset for handle ... state 1381192706
2023-04-05T05:01:29.165Z cpu14:2097732)VSCSI: 3226: handle 38295998585421404(GID:48732)(vscsi0:0):processing reset for handle ... state 1381192706
2023-04-05T05:01:29.655Z cpu57:8916482)WARNING: VSCSI: 3967: handle 38295998585421404(GID:48732)(vscsi0:0):WaitForCIF: Issuing reset; number of CIF:4
2023-04-05T05:01:29.655Z cpu57:8916482)WARNING: VSCSI: 2986: handle 38295998585421404(GID:48732)(vscsi0:0):Ignoring double reset
<snip>
2023-04-05T05:08:56.864Z cpu3:2097732)VSCSI: 3226: handle 38295998585421404(GID:48732)(vscsi0:0):processing reset for handle ... state 1381192706
2023-04-05T05:08:57.367Z cpu3:2097732)VSCSI: 3226: handle 38295998585421404(GID:48732)(vscsi0:0):processing reset for handle ... state 1381192706
2023-04-05T05:08:57.840Z cpu3:2097732)VSCSI: 3226: handle 38295998585421404(GID:48732)(vscsi0:0):processing reset for handle ... state 1381192706
2023-04-05T05:08:57.840Z cpu3:2097732)VSCSI: 3335: handle 38295998585421404(GID:48732)(vscsi0:0):Reset [Retries: 15/0] from (vmm0:SQLVM1)
2023-04-05T05:08:58.343Z cpu3:2097732)VSCSI: 3226: handle 38295998585421404(GID:48732)(vscsi0:0):processing reset for handle ... state 1381192706
2023-04-05T05:08:58.845Z cpu3:2097732)VSCSI: 3226: handle 38295998585421404(GID:48732)(vscsi0:0):processing reset for handle ... state 1381192706
2023-04-05T05:08:59.347Z cpu3:2097732)VSCSI: 3226: handle 38295998585421404(GID:48732)(vscsi0:0):processing reset for handle ... state 1381192706
2023-04-05T05:08:59.847Z cpu3:2097732)VSCSI: 3226: handle 38295998585421404(GID:48732)(vscsi0:0):processing reset for handle ... state 1381192706

Environment

VMware vSphere ESXi 7.x
VMware vSphere 7.0.x

Cause

In rare cases, if a write I/O request runs in parallel with an unmap operation triggered by the guest OS on a thin-provisioned VM, a deadlock might occur in a VMFS6 volume. As a result, the virtual machine may become unresponsive.

Resolution

This is a known issue that is resolved in ESXi 7.0 U3f. Please see the release notes:

https://docs.vmware.com/en/VMware-vSphere/7.0/rn/vsphere-esxi-70u3f-release-notes.html

Workaround:
To workaround this issue, the thin disks for a VM can be inflated/converted to thick. This will prevent the issuance of UNMAP commands from the GuestOS level and thus there would be no race condition between write I/Os and UNMAP operations.