Virtual machines on NFSv3 datastores might fail due to a failed snapshot consolidation during backup.

Products

VMware vSphere ESXi VMware vSphere ESXi 8.0

Issue/Introduction

This failure is seen when snapshot consolidation requested while the third party backup software is still actively holding one or more disks open.

You see the similar events recorded in the Host Management Agent logs (var/run/log/hostd.log)

Hostd[<WORLD_ID>]: [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/<NFS_VOLUME_UUID>/<VM_HOME_FOLDER>/<VM_NAME>.vmx] Handling vmx message nnnnnn: Locking conflict for file "/vmfs/volumes/<NFS_VOLUME_UUID>/<VM_HOME_FOLDER>/<VM_NAME>-000001-delta.vmdk". Kernel open flags are 0x8. Owner process on this host is world ID nnnnnn with world name vmx-vcpu-0:VM_NAME.
Hostd[<WORLD_ID>]: --> Failed to lock the file
Hostd[<WORLD_ID>]: --> Cannot open the disk '/vmfs/volumes/<NFS_VOLUME_UUID>/<VM_HOME_FOLDER>/<VM_NAME>-flat.vmdk' or one of the snapshot disks it depends on.
Hostd[<WORLD_ID>]: --> An operation required the virtual machine to quiesce and the virtual machine was unable to continue running.
Hostd[<WORLD_ID>]: [Originator@6876 sub=Vimsvc.ha-eventmgr] Event nnnn : Error message on VM_NAME on ESX_SERVER_FQDN in ha-datacenter: An operation required the virtual machine to quiesce and the virtual machine was unable to continue running.
Hostd[<WORLD_ID>] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event nnnn : Virtual machine VM_NAME disks consolidation failed on ESX_SERVER_FQDN in cluster CLUSTER_NAME in ha-datacenter.

If the VM was restarted by HA, the following logs can be observed in /var/run/log/fdm.log

[YYYY-MM-DDTHH:MM:SS] Db(167) Fdm[2107401] [Originator@6876 sub=Invt opID=WorkQueue-######] Vm /vmfs/volumes/###############/######/######.vmx changed guestHB=red
[YYYY-MM-DDTHH:MM:SS] In(166) Fdm[2107406] [Originator@6876 sub=Invt opID=WorkQueue-######] Vm /vmfs/volumes/###############/######/######.vmx curPwrState=powered on curPowerOnCount=1 newPwrState=powered off clnPwrOff=false hostReporting=__localhost__
[YYYY-MM-DDTHH:MM:SS] Db(167) Fdm[2107406] [Originator@6876 sub=Invt opID=WorkQueue-######] Vm /vmfs/volumes/###############/######/######.vmx localhost: local power state=powered off; global power state=powered off
[YYYY-MM-DDTHH:MM:SS] Db(167) Fdm[2107406] [Originator@6876 sub=Invt opID=WorkQueue-######] vm /vmfs/volumes/###############/######/######.vmx from __localhost__ changed inventory cleanPwrOff=0
[YYYY-MM-DDTHH:MM:SS] Db(167) Fdm[2107406] [Originator@6876 sub=Invt opID=WorkQueue-######] Vm /vmfs/volumes/###############/######/######.vmx changed guestHB=gray
[YYYY-MM-DDTHH:MM:SS] Db(167) Fdm[2107403] [Originator@6876 sub=Execution opID=host-######:##:########-#] Failing over vm /vmfs/volumes/###############/######/######.vmx (isRegistered=true)
[YYYY-MM-DDTHH:MM:SS] Db(167) Fdm[2107403] [Originator@6876 sub=Execution opID=host-######:##:########-#] Registering vm done (vmid=/vmfs/volumes/###############/######/######.vmx, hostdVmId=)
[YYYY-MM-DDTHH:MM:SS] Db(167) Fdm[2107403] [Originator@6876 sub=Execution opID=host-######:##:########-#] Reconfiguring vm
[YYYY-MM-DDTHH:MM:SS] Db(167) Fdm[2107567] [Originator@6876 sub=Invt opID=WorkQueue-######] GuestKernelCrashed is false for VM 38
[YYYY-MM-DDTHH:MM:SS] Db(167) Fdm[2107567] [Originator@6876 sub=Invt opID=WorkQueue-######] VM 38: Updated GuestKernelCrashed!

Environment

Cause

Backup software is requesting snapshot removal/consolidation while it is still holding a read-only access to the parent of the delta disk VM is currently writing to.
Since the parent is delta disk is still open, consolidation fails as exclusive access is not possible.
- Parent disk stays read-only while a child delta is open.
During this failure, the read-only lock on the parent disk held for the NFC open from backup software, inadvertently gets upgraded failing any further opens
Till the time backup proxy closes the parent disk, VM fails to power on.

Resolution

Powering on the Failed VM

To be able to power on the VM, you need to get the locks on parent disk release by force closing the open. This can be done bi either,

Stop/forcefully kill the pending backup jobs for the VM on the backup proxy/server.

Or

Reboot the ESXI Server where the VM was running at the time of failure after migrating other running VMs.

Permanent Fix: To avoid the VM failure patch the ESXi Servers to ESXi Server 8.0 Update 3e as it is fixed with this release.

Workaround:
There are per-VM and per-Host workarounds available to disable the NFS lock upgrading functionality that is causing this issue.

Per VM:
The workaround to disable the NFS lock upgrading functionality is setting a VM configuration "consolidate.upgradeNFS3Locks" to "FALSE". i.e. set the following in the VM configuration file

consolidate.upgradeNFS3Locks = "FALSE"

This would require powering off the VM, setting the config, and powering the VM back on. Follow Tips for editing a .vmx file to update the VM configuration file.

Per Host:
If the host wide config is set during maintenance mode, then this will happen automatically as VMs are migrated back to the host. Recommended steps for this would be:
- Put host in maintenance mode.
- SSH into the host and edit /etc/vmware/config to add the following line:
consolidate.upgradeNFS3Locks = "FALSE"
- Exit maintenance mode. From this point, once VMs migrate back, they will pickup the host level configuration.

Additional Information

NFSv3 データストア上の仮想マシンがバックアップ中のスナップショット統合が失敗することにより停止する可能性がある (Japanese)