Snapshot consolidation task fails with error "Unable to access file since it is locked An error occurred while consolidating disks: Failed to lock the file".
The error repeats for every attempts of the consolidation.
VMware vSphere with Snapshot based backup solutions.
One or more snapshot disks in the snapshot chain of the impacted VM is still locked by a process other than the VM's vmx process itself.
In a snapshot based backup solution, the proxy VMs need to provision the snapshot disk in read-only mode to take backup of the VM. During this time, the ESXi host where the backup proxy VM is running places a read-only lock on the snapshot flat file.
Due to some issues, if the backup job did not release the lock after the backup is completed, the consolidation task will fail and the snapshot chain continues to grow until it reaches the max of 255.
The consolidation task will succeed only when there are no locks placed on any of the snapshot file by any other process other than the VM's own vmx process.
VMX file will have the below events logged:
YYYY-MM-DDThh:mm:ss.417Z In(05) vmx - SnapshotVMX_Consolidate: Starting online snapshot consolidate operation.
YYYY-MM-DDThh:mm:ss.541Z In(05) vmx - ConsolidateFillSnapDiskTransferArray: Item 0 source: /vmfs/volumes/vsan:<vsan Datastore UUID>/<VSAN OBJECT ID - VM HOME DIR>/Test-VM-000255.vmdk dest: /vmfs/volumes/vsan:<vsan Datastore UUID>/<VSAN OBJECT ID - VM HOME DIR>/Test-VM.vmdk. Cumulative size of redo logs (including meta-data): 6883860480.
YYYY-MM-DDThh:mm:ss.631Z In(05) vcpu-0 - [msg.disklib.numLinks.maxReached] This virtual machine has 255 or more redo logs in a single branch of its snapshot tree. The maximum supported limit has been reached, creating new snapshots will not be allowed. To create new snapshots, please delete old snapshots or consolidate the redo logs.
YYYY-MM-DDThh:mm:ss.745Z In(05) vcpu-0 - ConsolidateEnd: Snapshot consolidate complete: Failed to lock the file (5).
vmkernel will have the below entries during the consolidation attempt:
YYYY-MM-DDThh:mm:ss.551Z cpu26:19100970 opID=74f54281)DLX: 2650: vol '<VSAN OBJECT ID - VM HOME DIR>', lock at 125706240: Lock type: 10C00001. Read Lock(s) held on a file on volume
<VSAN OBJECT ID
>. numHolders:1 gblNumHolders:0,$YYYY-MM-DDThh:mm:ss.551Z cpu26:19100970 opID=74f54281)[type 10c00001 offset 125706240 v 13942, hb offset 3215360
gen 263, mode 2, owner 00000000-00000000-0000-000000000000 mtime 11919512
num 1 gblnum 0 gblgen 0 gblbrk 0] alloc owner 0
YYYY-MM-DDThh:mm:ss.551Z cpu26:19100970 opID=74f54281)DLX: 2651: vol '<VSAN OBJECT ID - VM HOME DIR>', lock at 125706240: Lock type: 10C00001. owner(s) MAC: ##:##:##:##:##:##:
YYYY-MM-DDThh:mm:ss.551Z cpu26:19100970 opID=74f54281)[type 10c00001 offset 125706240 v 13942, hb offset 3215360
gen 263, mode 2, owner 00000000-00000000-0000-000000000000 mtime 11919512
num 1 gblnum 0 gblgen 0 gblbrk 0] alloc owner 0
YYYY-MM-DDThh:mm:ss.551Z cpu26:19100970 opID=74f54281)Fil3: 5033: Lock failed on file: .<VSAN OBJECT ID - VM HOME DIR>.lck on vol 'adf2e15e-3e0d-c0a6-fffe-9440c9228f9c' with FD: <FD c288 r4>
Determine the locks held on the files by following the below KBs.
VMFS : https://knowledge.broadcom.com/external/article/314365/investigating-virtual-machine-file-locks.html
vSAN: https://knowledge.broadcom.com/external/article/326800/investigating-virtual-disk-file-locks-on.html
If the above process determines that the locks are help by two different hosts, login to the ESXi host that does not host the impacted VM and validate which process is holding the lock using the command "lsof". If it is held by a vmx process there, it is likely that the proxy vm is holding the lock.
If the impacted VM and the backup proxy VM are running on the same ESXi host, the above will not display the MAC address of any other host. Thus, running lsof and filter for the file names will help to identify if there are any extra locks and determine the process id of the vmx process
If a file lock is held by a process ID that is not tagged to impacted VM, determine which vm is having that process ID. If it is a proxy VM, have the customer engaged to release the lock by de-provisioning the vmdk of the impacted VM.
They must need to review if there are any active backup jobs running on the mounted vmdk and follow the backup vendor guideline to de-provision. Often, power cycle (not reboot) the proxy VM will help to release the lock, however it should be left to the user discretion. Once the power cycle is completed, make sure to remove the vmdk reference form the backup proxy VM.