VMs take two minutes to power on after a High Availability/Fault Domain Manager failover when NFS v3 storage is used
search cancel

VMs take two minutes to power on after a High Availability/Fault Domain Manager failover when NFS v3 storage is used

book

Article ID: 393514

calendar_today

Updated On:

Products

VMware vCenter Server VMware vSphere ESXi 8.0 VMware vSphere ESXi 7.0

Issue/Introduction

VMs failover successfully, but the powerOn operation takes two minutes to complete.

hostd.log on the ESXi host contains entries such as these while registering and powering on the VM:

YYYY-MM-DDTHH:mm:ss Wa(164) Hostd[xxxxxx]: [Originator@6876 sub=IoTracker] In thread xxxxxx, open("/vmfs/volumes/xxxxxxxx/xxxxxxxx/xxxxxxxx.vmx.lck") took over 8 sec.
YYYY-MM-DDTHH:mm:ss Wa(164) Hostd[xxxxxx]: [Originator@6876 sub=IoTracker] In thread xxxxxx, open("/vmfs/volumes/xxxxxxxx/xxxxxxxx/xxxxxxxx.vmx.lck") took over 18 sec.
YYYY-MM-DDTHH:mm:ss Wa(164) Hostd[xxxxxx]: [Originator@6876 sub=IoTracker] In thread xxxxxx, open("/vmfs/volumes/xxxxxxxx/xxxxxxxx/xxxxxxxx.vmx.lck") took over 28 sec.
YYYY-MM-DDTHH:mm:ss Wa(164) Hostd[xxxxxx]: [Originator@6876 sub=IoTracker] In thread xxxxxx, open("/vmfs/volumes/xxxxxxxx/xxxxxxxx/xxxxxxxx.vmx.lck") took over 38 sec.

vmkernel.log contains the following:

YYYY-MM-DDTHH:mm:ss In(182) vmkernel: cpu4:xxxxxx)NFSLock: 3387: lock .lck-xxxxxxxxxxxx expired: counter prev xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx : curr xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx (loop count 3)
YYYY-MM-DDTHH:mm:ss In(182) vmkernel: cpu0:xxxxxx)NFSLock: 3387: lock .lck-xxxxxxxxxxxx expired: counter prev xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx : curr xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx (loop count 3)
YYYY-MM-DDTHH:mm:ss In(182) vmkernel: cpu52:xxxxxx)NFSLock: 3387: lock .lck-xxxxxxxxxxxx expired: counter prev xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx : curr xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx (loop count 3)
YYYY-MM-DDTHH:mm:ss In(182) vmkernel: cpu6:xxxxxx)NFSLock: 3387: lock .lck-xxxxxxxxxxxx expired: counter prev xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx : curr xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx (loop count 3)

Environment

ESXi 7.0

ESXi 8.0

Cause

This is caused by the file locking mechanism in use in NFS v3. When a host fails, stale lock files are left behind on the NFS v3 datastore. It takes 40 seconds for the lock ownership to move to the new host when any files are accessed.

In total, the 40 second timeout is hit three times during the overall operation:

  • 40 seconds to access the vmx lock during the reconfigure task.
  • 40 seconds in vmmon power on.
  • 40 seconds while opening disks.

This totals 120 seconds/two minutes.

Resolution

This is by design and is a limitation of the locking mechanism in NFS v3. An improvement in failover timing would be expected in NFS v4.1 due to a different locking mechanism in use. For further information on this, please contact your NFS storage vendor.