VMs failover successfully, but the powerOn operation takes two minutes to complete.
hostd.log on the ESXi host contains entries such as these while registering and powering on the VM:
YYYY-MM-DDTHH:mm:ss Wa(164) Hostd[xxxxxx]: [Originator@6876 sub=IoTracker] In thread xxxxxx, open("/vmfs/volumes/xxxxxxxx/xxxxxxxx/xxxxxxxx.vmx.lck") took over 8 sec.YYYY-MM-DDTHH:mm:ss Wa(164) Hostd[xxxxxx]: [Originator@6876 sub=IoTracker] In thread xxxxxx, open("/vmfs/volumes/xxxxxxxx/xxxxxxxx/xxxxxxxx.vmx.lck") took over 18 sec.YYYY-MM-DDTHH:mm:ss Wa(164) Hostd[xxxxxx]: [Originator@6876 sub=IoTracker] In thread xxxxxx, open("/vmfs/volumes/xxxxxxxx/xxxxxxxx/xxxxxxxx.vmx.lck") took over 28 sec.YYYY-MM-DDTHH:mm:ss Wa(164) Hostd[xxxxxx]: [Originator@6876 sub=IoTracker] In thread xxxxxx, open("/vmfs/volumes/xxxxxxxx/xxxxxxxx/xxxxxxxx.vmx.lck") took over 38 sec.
vmkernel.log contains the following:
YYYY-MM-DDTHH:mm:ss In(182) vmkernel: cpu4:xxxxxx)NFSLock: 3387: lock .lck-xxxxxxxxxxxx expired: counter prev xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx : curr xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx (loop count 3)YYYY-MM-DDTHH:mm:ss In(182) vmkernel: cpu0:xxxxxx)NFSLock: 3387: lock .lck-xxxxxxxxxxxx expired: counter prev xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx : curr xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx (loop count 3)YYYY-MM-DDTHH:mm:ss In(182) vmkernel: cpu52:xxxxxx)NFSLock: 3387: lock .lck-xxxxxxxxxxxx expired: counter prev xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx : curr xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx (loop count 3)YYYY-MM-DDTHH:mm:ss In(182) vmkernel: cpu6:xxxxxx)NFSLock: 3387: lock .lck-xxxxxxxxxxxx expired: counter prev xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx : curr xxxxxx xxxxxx-xxxxxx-xxxx-xxxxxxxx (loop count 3)
ESXi 7.0
ESXi 8.0
This is caused by the file locking mechanism in use in NFS v3. When a host fails, stale lock files are left behind on the NFS v3 datastore. It takes 40 seconds for the lock ownership to move to the new host when any files are accessed.
In total, the 40 second timeout is hit three times during the overall operation:
This totals 120 seconds/two minutes.
This is by design and is a limitation of the locking mechanism in NFS v3. An improvement in failover timing would be expected in NFS v4.1 due to a different locking mechanism in use. For further information on this, please contact your NFS storage vendor.