VM freezes for long time from backup activity on Nutanix Storage with NFSv3
search cancel

VM freezes for long time from backup activity on Nutanix Storage with NFSv3

book

Article ID: 338063

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

  • VM might freeze for long time with the backup activity
  • Observe longer freeze time based on the number of disks attached to the VM
  • Freeze time on single disk attached VM ranges from 30 to 40 seconds
  • In vmware.log, you see the logging stopped for 40 seconds after the disk opened for consolidation entries similar to:
2017-12-12T03:10:37.752Z| vcpu-0| I125: DISKLIB-LIB   : Opened "/vmfs/volumes/c11c2c66-45b721c5/VirtualMachine/VirtualMachine_2-000001.vmdk" (flags 0x8, type vmfsSparse).
2017-12-12T03:11:17.784Z| vcpu-0| I125: nutanix_nfs_plugin: Established VAAI session with NFS server NFS_SERVER_IP datastore: /vmfs/volumes/c11c2c66-45b721c5 remoteshare: /NFSDatastore timeout: 30 seconds crossvol: 1 snap: 1 xfer_size: -1
2017-12-12T03:11:17.785Z| vcpu-0| I125: nutanix_nfs_plugin: ExtentdedStat /vmfs/volumes/c11c2c66-45b721c5/VirtualMachine/VirtualMachine_2-flat.vmdk
  • VM is backed by a datastore using NFSv3
  • Observed on Proxy based backup where Backup Proxy VM is hosted on different host than the host where affected VM is powered on.

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware vSphere ESXi 6.5
VMware vSphere ESXi 6.7
VMware vSphere ESXi 7.0

Cause

This is a NFSv3 locking mechanism limitation when the Backup Proxy VM is accessing the disks of VMs from a different host than the host where affected VM is powered on. Instead of Maintaining the lock state NFSv3 let the previous lock to expire before claiming a new lock.

The scenario can be explained as below:

  1. VM is running on Host1, Backup Proxy VM is running on Host2
  2. A snapshot is created on Host1 when backup is initiated
  3. Parent disk is now shared between Host1 and Host2 in READ mode
  4. Backup is done on Host2 
  5. Post backup snapshot deletion request comes on Host1.
  6. This triggers consolidation on child disk to parent disk
  7. Host1 requires Exclusive lock(EXCL) on parent disk for consolidation
  8. Since parent disk has shared READ lock, it waits till the lock is expired
  9. After 30 secs it acquires Exclusive lock and consolidation is performed

The same situation may happen with a Clone operation targeting a second host:

  1. Source VM is running on Host1, Host2 for clone destination
  2. On Host1 it creates a snapshot(child disk)
  3. Parent disk is now shared between Host1 and Host2 in READ mode
  4. Copy is done on Host2 and it creates a cloned VM
  5. After clone operation snapshot deletion request comes on Host1.
  6. This triggers consolidation on child disk to parent disk
  7. For consolidation Host1 requires Exclusive lock (EXC) on parent disk
  8. Since parent disk has shared READ lock, it waits till the lock is expired
  9. After 30 secs it acquires Exclusive lock (EXCL) and consolidation is performed

Resolution

To resolve this issue use one of the below options:

  1. Make use of NFSv4.1. In this Locking(state management) is offloaded to NFS Server(Array)
  2. Make use of iSCSI or FC datastore for VC
  3. Move the backup proxy VM on same ESXi host as that of VM being backed up. Note: Consult backup vendor to make changes so that for a VM backup, proxy VM on same ESXi host is being used.