Virtual machines residing on NFSv3 storage become unresponsive during a snapshot removal operation when CBT is enabled
search cancel

Virtual machines residing on NFSv3 storage become unresponsive during a snapshot removal operation when CBT is enabled

book

Article ID: 428200

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

  • When NFSv3 is used as a VM datastore, and they are being backed up by a VADP-enabled backup solution that uses CBT (Change Block Tracking), then you may experience these symptoms:
    • While removing snapshots after backing up a virtual machine residing on an NFSv3 datastore using a backup application, the virtual machine becomes unresponsive for approximately 30 seconds.
    • This issue occurs when the target virtual machine disk was hot-added for the backup/restore operation.
    • This issue may occur with NBD mode as well if a vMotion of the VM happens while it's backup job is in progress. DRS can trigger a vMotion of the VM.

Note: If the issue is observed while CBT is not enabled, refer https://knowledge.broadcom.com/external/article?legacyId=2010953 

Environment

VMware vSphere ESX

Cause

  • This issue occurs when the target virtual machine and the backup appliance which uses CBT reside on two different hosts, and the NFSv3 protocol is used to mount NFS datastores. A limitation in the NFSv3 locking method causes a lock timeout, which pauses the virtual machine being backed up.
  • The pause will be observed as a long stun time in the /vmfs/volumes/<datastore name>/vmware.log. In the example below it can be seen that the VM was stopped\stunned for 40 seconds:

In(05) vcpu-0 - Checkpoint_Unstun: vm stopped for 40116333 us
In(05) vcpu-0 - CPT: vm was stunned for 40141865 us

  • The delay is seen during the CBT flushing operation of the CTK (Changing Tracking File) which is part of the snapshot deletion task.
  • The following is an example from a /vmfs/volumes/<datastore name>/vmware.log, where it can be seen that 40 seconds elapses from the time of the flush operation to the time of the 'ChangeTracker_EndCombine' message:

2026-02-04T12:53:50.225Z In(05) vcpu-0 - DISKLIB-CTK   : Forcing flush of change info for "/vmfs/volumes/<volume uuid>/<vm directory name>/vm name-ctk.vmdk".
2026-02-04T12:53:50.225Z In(05) vcpu-0 - DISKLIB-CTK   : ChangeTracker_EndCombine()
2026-02-04T12:54:30.257Z In(05) vcpu-0 - DISKLIB-CTK   : resuming /vmfs/volumes/<volume uuid>/<vm directory name>/vm name-ctk.vmdk"
2026-02-04T12:54:30.258Z In(05) vcpu-0 - DISKLIB-CTK   : ChangeTracker_EndCombine: Renaming /vmfs/volumes/<volume uuid>/<vm directory name>/vm name-ctk.vmdk"

Resolution

  • This is a known limitation with locking mechanism on NFSv3 datastores and there is no fix as of now for this issue.
  • Workarounds
    • Ensure that the source virtual machine resides in the same ESXi host where the backup appliance is running.
    • Use LAN/NBD transport that uses the NFC (Network File Copy) in your backup solution or disable SCSI hot-add through the backup software. For more information about NBD transport, see the Virtual Disk Transport Methods section in the Virtual Disk Development Kit (VDDK) Programming Guide.
  • Above workarounds will mitigate the occurrence of the issue in the scenarios where there is no migration while the backup is in progress. Following is the only workaround if VM vMotion coincides with the VM's backup.

Note: Before implementing refer to NFS Protocols and vSphere Solutions - NFS Datastore Concepts and Operations in vSphere Environment