Investigating virtual disk file locks on vSAN
search cancel

Investigating virtual disk file locks on vSAN

book

Article ID: 326800

calendar_today

Updated On:

Products

VMware vSAN

Issue/Introduction

Symptoms:

  • VMs fail to power on.
  • Snapshots fail to delete or consolidate.
  • VM fails to clone or vMotion.

On a VMFS datastore, file locks are often validated on the -flat or -delta file virtual disk file. These files don't exist on vSAN since it is an object base system. This article details how to check for locks on those virtual disk objects.
File lock issues can cause various problems. For example, a VM power on or snapshot consolidation may fail. 

Environment

VMware vSAN (All Versions)

Cause

vSAN has a specific object type, vdisk, for virtual disks. They are not stored with the configuration files for the VM in the namespace directory.

Resolution

Check for backup proxy servers in use. If there are then check if the affected disk is still mounted to the proxy server. If the disk is attached to the proxy server then remove the disk from the proxy server ensuring "Delete from disk" is NOT selected.

Note: There may be more than one proxy server in use. Make sure to check all proxy servers.
vSAN uses .lck files. The name of the .lck file will have the UUID of the VSAN object it represents as the file name.

To check the Descriptor, change the directory into the VM namespace.
         cd /vmfs/volumes/vsanDatastore/<VM_Namespace>

Run the below command to pull relevant UUID's: 
                  grep RW VMDiskName.vmdk

                   Sample output:
                   # Extent description
          RW 209715200 VMFS "vsan://########-####-####-####-########31f0"

                   Note: If there is an error with device or resource busy, then SSH to the host the VM is registered to and work from that host.

The UUID “########-####-####-####-########31f0” is the vSAN object representing the vdisk for that descriptor.
The following command will show all  .<uuid>.lck files within the vSAN namespace directory:

                   # ls -lah .*.lck

          Sample output:
  
    -rw------- 1 root root 0 Jul 13 2017 .########-####-####-####-########31f0.lck
 
There may also be non-hidden lock files. These can be diagnosed with a similar command below:

         # ls -lah *.lck
 
Execute the following command to show the local details for this vSAN object:
 
     vmfsfilelockinfo -p .########-####-####-####-########31f0.lck 

                   Sample output:
          vmfsfilelockinfo Version 2.0
          Looking for lock owners on ".########-####-####-####-########31f0.lck"
          "<VMname>.vswp.lck" is locked in Exclusive mode by host having mac address ['xx:xx:xx:xx:xx:xx']
          Trying to make use of Fault Domain Manager
          ----------------------------------------------------------------------
          Found 6 ESX hosts using Fault Domain Manager.
          ----------------------------------------------------------------------
         Searching on Host esxi1
         Searching on Host esxi3
         Searching on Host esxi4
         Searching on Host esxi2
         Searching on Host esxi6
         Searching on Host esxi5
           MAC Address : ##:##:##:##:##:##

         Host owning the lock on file is esxi5, lockMode : Exclusive
         Total time taken : 0.11339905299246311 seconds.

                  Sample output if no lock is found:
             vmfsfilelockinfo Version 2.0
          Looking for lock owners on ".########-####-####-####-########31f0.lck"
          ".########-####-####-####-########31f0.lck" is not locked by any ESX host and is Free
          Total time taken : 0.037906300276517868 seconds.

Alternatively, run the command vmkfstools -D against this file. This will show the lock details for this vSAN object as well.
          # vmkfstools -D .########-####-####-####-########31f0.lck

               Sample output: 
        Lock [type 10c00001 offset 152799232 v 830, hb offset 3969024
        gen 215, mode 1, owner ########-######dc-07eb-########2052 mtime 1107249
        num 0 gblnum 0 gblgen 0 gblbrk 0]
        Addr <4, 354, 1>, gen 3, links 1, type reg, flags 0, uid 0, gid 0, mode 600
        len 0, nb 0 tbz 0, cow 0, newSinceEpoch 0, zla 4305, bs 8192

The part in bold is the MAC address of the management VMkernel port. It should correspond to a host in the vSAN cluster.

Note: During the life-cycle of a powered on virtual machine, several of its files transitions between various legitimate lock states. The lock state mode indicates the type of lock that is on the file. The list of lock modes is:
mode 0 = no lock
mode 1 = is an exclusive lock (vmx file of a powered on virtual machine, the currently used disk (flat or delta), *vswp, and so on.)
mode 2 = is a read-only lock (For example on the ..-flat.vmdk of a running virtual machine with snapshots)
mode 3 = is a multi-writer lock (For example used for MSCS clusters disks or FT VMs)

 
 
SSH into that host that owns the lock and try restarting the management services hostd & vpxa with the following command:

/etc/init.d/hostd restart && /etc/init.d/vpxa restart

If the lock is still present then run the below command:
 
lsof |grep <vmname> && ps|grep <vmname>
 
Example of command and output:
 
[root@esxi4:~] lsof |grep cent7_2 && ps|grep cent7_2
7565528     vmx                   FILE                       43   /vmfs/volumes/vsan:########-########-####-####-####-########5523/########-####-####-####-########81e8/cent7_2.vmx.lck
7565528     vmx                   FILE                       44   /vmfs/volumes/vsan:########-########-####-####-####-########5523/########-####-####-####-########81e8/cent7_2.vmx
7565528     vmx                   FILE                       45   /vmfs/volumes/vsan:########-########-####-####-####-########5523/########-####-####-####-########81e8/cent7_2.vmx~
7565528     vmx                   FILE                       82   /vmfs/volumes/vsan:########-########-####-####-####-########5523/########-####-####-####-########81e8/cent7_2.nvram
7565529  0        vmm0:cent7_2
7565533  0        vmm1:cent7_2
7565535  7565528  vmx-filtPoll:cent7_2
7565536  7565528  vmx-mks:cent7_2
7565537  7565528  vmx-svga:cent7_2
7565538  7565528  vmx-vcpu-0:cent7_2
7565540  7565528  vmx-vcpu-1:cent7_2
 

The number in bold is the world process ID. This can kill this process by running kill <PID>.
 
Warning: Only run this command from the host or hosts the VM is NOT registered to.

Note: If the VM is powered down there should be no open files (lsof) or active processes (ps) for the VM. There should only be open files or active processes on the host where the VM is registered to when the VM is powered on.

If there are no locks with either of the lock commands then run the following command on all hosts in the cluster:
 
lsof |grep <vmname> && ps|grep <vmname>
 
This step will help find a process on more than one host. If there are running processes, then kill the process on any of the hosts that might have a hung process related to the VM.
 
Warning: Only kill the process on hosts where the VM is NOT registered. This is critical if the VM is powered on.

If the locked file errors persist and vmfsfilelockinfo -p or vmkfstools -D commands finds no locks and lsof |grep <vmname> && ps|grep <vmname> finds no active process for the VM on any host then a rolling reboot of the cluster will be required to clear the lock.

In order to check all the VM files and/or vSAN object lock files get the name of the files and/or vSAN object lock files that are locked, also which host is locking the files, run the following commands in the VM directory:

for file in *; do echo ${file}; vmfsfilelockinfo -p ${file} |grep -i mode; done
 
Sample output:

Test-3f9d789c.hlog
Test-ec315dde.vswp
Test-ec315dde.vswp.lck
"Test-ec315dde.vswp.lck" is locked in Exclusive mode by host having mac address ['00:##:56:##:11:##']
Host owning the lock on file is <Hostname>, lockMode : Exclusive
Test.nvram
"Test.nvram" is locked in Exclusive mode by host having mac address ['00:##:56:##:11:##']
Host owning the lock on file is <Hostname>, lockMode : Exclusive
Test.vmdk
Test.vmsd
Test.vmx


 
Typically the owner host will be in the output. If there is different host, save the name of that host. 

To check all .<uuid>.lck files run the below command:

for file in .*lck; do echo ${file}; vmfsfilelockinfo -p ${file} |grep -i mode; done
 
To check all the files for VMs that have spaces in the name run the below command:

for file in *; do echo "${file}"; vmfsfilelockinfo -p "${file}" |grep -i mode; done

Additional Information

See the following KBs with respect to:

Committing snapshots when there are no snapshot entries in the Snapshot Manager
Investigating virtual machine file locks on ESXi

Restarting the Management agents in ESXi

Note: It is possible for a VM to shutdown after consolidation if a lock is obtained during the switchover period between the initial disk and base disk. See Virtual Machine shuts down after a disk consolidation due to a locked file for more details.