Investigate Lock Contention on ESXi hosts
search cancel

Investigate Lock Contention on ESXi hosts

book

Article ID: 384496

calendar_today

Updated On:

Products

VMware vSphere ESX 6.x VMware vSphere ESX 7.x VMware vSphere ESX 8.x VMware vSphere ESXi

Issue/Introduction

ESXi vmkernel.log indicates a degree of lock contention reporting entries similar to: 

<timestamp> cpu3:2315260)DLX: 4333: vol '<datastoreName>', lock at 248979456: [Req mode 1] Checking liveness:
<timestamp> cpu3:2315260)[type 10c00001 offset 248979456 v 137952457, hb offset 3440640
gen 371, mode 1, owner ########-########-####-############ mtime 50755437
num 0 gblnum 0 gblgen 0 gblbrk 0]


<timestamp> cpu0:4227286)DLX: 4985: vol '<datastoreName>', lock at 20955136: [Req mode: 1] Not free:
<timestamp> cpu0:4227286)[type 10c00002 offset 20955136 v 1672, hb offset 3735552
gen 8065, mode 1, owner #######-########-####-############ mtime 34117143
num 0 gblnum 0 gblgen 0 gblbrk 0] alloc owner 3735552

It is unclear if this logging indicates that the host is suffering from significant log contention which impacts performance. 

Environment

VMware vSphere ESXi 6.7
VMware vSphere ESXi 7.0.x
VMware vSphere ESXi 8.0.x

Resolution

A degree of lock contention is normal and expected on ESXi hosts accessing shared storage. 

Lock contention will have an impact on performance/operations where an ESXi host is unable to access a specific lock for an extended period of time. 

This can be checked by confirming if a host repeatedly fails to connect to the same lock with the same lock version:  

<timestamp> cpu3:2315260)DLX: 4333: vol '<datastoreName>', lock at 248979456: [Req mode 1] Checking liveness:
<timestamp> cpu3:2315260)[type 10c00001 offset 248979456 v 137952457, hb offset 3440640
gen 371, mode 1, owner ########-########-####-############ mtime 50755437
num 0 gblnum 0 gblgen 0 gblbrk 0]

(Lock version at a specific offset increments each time a host successfully access the lock. An unchanged lock version for an offset indicates a repeat attempt to access a lock.)

 

To confirm the pattern run:

grep -Ei -A1 "checking liveness|not free" vmkernel.all | grep offset | awk '{print $5,$7}'|sort | uniq -c | sort -r | less

partial sample output:

3 18882560 135532,
3 18882560 135426,
3 18882560 135328,
3 18882560 135230,
3 18882560 135022,
3 18882560 134924,
2 94699520 402,
2 94658560 142,
...

The above indicates that the ESXi host attempted 3 times to access lock at offset 18882560 with version 135532 but lock was held by another host. Subsequently, it took 3 attempts to access the same lock offset with higher lock version. This pattern would indicate moderate levels of lock contention. 
 
If the number of times is large it indicates that another host has been holding a lock for a long time (many minutes, hours) for the processing of a single I/O operation. The lock is stale.

(Note that logging may vary in different versions of ESXi and the column numbers selected via awk may need to be adjusted in the scriptlet to capture lock offset and version.)  

If a specific lock offset is identified as having been inaccessible to a host for an extended period of times, the host holding the lock can be identified from the owner UUID. From the UUID seen in the log message, the last string is related to the MAC address of the host holding the lock. For example, if the UUID was "###########-#######-####-#############", the MAC address would be "#############". The host will typically need to be rebooted to allow other hosts to access the lock. 

 

To determine if a particular host may be holding locks more frequently that others (and may have an issue), run: 

grep -Ei -A2 "checking liveness|not free" vmkernel.all | grep owner | awk '{print $6}'|sort | uniq -c | sort -r  

 

To determine the number of such lock contention entries are generated per datastore:

grep -Ei "checking liveness|not free" vmkernel.all | awk '{print $5}'| sort | uniq -c | sort -r 

(Again, adjust the numbered columns specified in the awk commands as necessary). 



Additional Information

If the vmkernel log report "Lock Rank Violation Received" it indicates that a software deadlock has arisen in accessing a specific lock, beyond normal lock contention. Further investigation will be required.