Poor VM performance with "Lock was not free" errors

Products

VMware vSphere ESXi

Issue/Introduction

One or more VMs has poor VM performance or may become unresponsive
I/O errors are reported against some VM files
Logs report frequent warnings of "Lock was not free" against the vmfs datastore that VM(s) reside on
Hostd may report that certain threads related to VM file access are slow/stuck for extended periods, e.g.:

Hostd[2099168] [Originator@6876 sub=IoTracker] In thread 2099146, fopen("/vmfs/volumes/########-########-####-############/<VM_name>/<VM_name>.vmx") took over 267 sec.
Hostd[2099144] [Originator@6876 sub=IoTracker] In thread 2099176, fopen("/vmfs/volumes/########-########-####-############/<VM_name>/<VM_name>.vmx") took over 1885 sec.
Hostd[2099179] [Originator@6876 sub=IoTracker] In thread 2099167, fopen("/vmfs/volumes/########-########-####-############/<VM_name>/<VM_name>.vmx") took over 1595 sec.

Environment

VMware vSphere ESXi 7.0.x
VMware vSphere ESXi 8.0.x

Cause

These symptoms may arise where there is a high degree of lock contention on the VMFS datastore.

Note that a degree of lock contention is normal and expected on ESXi hosts accessing shared storage. Lock contention becomes an issue where a host fails to acquire a specific datastore lock for an extended period due to the lock being held by another host.

Identify lock contention:
When a host accesses a lock at a specific offset, the lock version is incremented. If a lock has the same version on repeated attempts to access it, this means that lock has been held and not released for the duration of these attempts.

Logging of the following type will be seen om vmkernel.log:

<timestamp> cpu3:2315260)DLX: 4333: vol '<datastoreName>', lock at 248979456: [Req mode 1] Checking liveness:
<timestamp> cpu3:2315260)[type 10c00001 offset 248979456 v 137952457, hb offset 3440640
gen 371, mode 1, owner ########-########-####-############ mtime 50755437
num 0 gblnum 0 gblgen 0 gblbrk 0]

<timestamp> cpu0:4227286)DLX: 4985: vol '<datastoreName>', lock at 20955136: [Req mode: 1] Not free:
<timestamp> cpu0:4227286)[type 10c00002 offset 20955136 v 1672, hb offset 3735552
gen 8065, mode 1, owner #######-########-####-############ mtime 34117143
num 0 gblnum 0 gblgen 0 gblbrk 0] alloc owner 3735552

E.g. the following output lists number of attempts to access a lock (first column), the lock offset (second column), the lock version (third column):
grep -Ei -A1 "checking liveness|not free" vmkernel.log | grep offset | awk '{print $7,$9}'|sort | uniq -c | sort -r | less
57 240992438 76,
49 240955673 108,
47 160571391 46,
43 187776960 59,
32 190488438 99,
...

Note: logging may vary slightly between versions

Resolution

Identify if one host is predominantly the holder of locks under contention, e.g.:

grep -Ei -A2 "checking liveness|not free" vmkernel.* | grep owner | awk '{print $9}'|sort | uniq -c | sort -r
4773 <ESXi host UUID_1>
1 <ESXi host UUID_2>
1 <ESXi host UUID_3>
1 <ESXi host UUID_4>

Place this host into maintenance mode and reboot the host.

If there is no one host which is causing the issue:

Perform general storage checks (configuration, drivers/firmware, connectivity issues, errors)
Check for high datastore utilization (> 70%) on impacted datastores, which can increase lock contention. Consider VM storage migration to reduce utilization to attempt to mitigate the issue.
Where lock contention relates to a number of lock offsets which are repeated a source of contention, consider storage migration of VMs on the impacted datastores and hosts, as a possible mitigation.

A good understanding of the pattern of lock contention in the specific case is important for any deeper investigation of the issue.