Frequent HA Failover Alarms Every 8 Minutes and Unexpected VM Reboots on NFS Datastores with Duplicate VM Errors
search cancel

Frequent HA Failover Alarms Every 8 Minutes and Unexpected VM Reboots on NFS Datastores with Duplicate VM Errors

book

Article ID: 396306

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This article addresses environments experiencing frequent vSphere HA failover alarms that occur approximately every 8 minutes with "Duplicate VM" detection messages in logs. This issue primarily affects VMs on NFS datastores and causes unexpected VM reboots, especially during or after vMotion operations. End users may report application slowness or disruptions as VMs are remediated repeatedly by HA.

You may observe the following symptoms in your vSphere environment:

- Frequent vSphere HA failover alarms occurring approximately every 8 minutes
- Multiple VMs affected across the cluster with remediation events
- Alarms showing as "yellow" for 30 seconds to 2 minutes before turning green again
- Log entries showing "Duplicate VM" detection in FDM logs
- Virtual machines being rebooted unexpectedly following vMotion operations
- Issues predominantly affecting VMs located on NFS datastores
- End users experiencing intermittent slowness or application disruptions
- Problem may be more prevalent during certain periods of the day (for example, between 12:00 UTC and 21:00 UTC)

In the ESXi host /var/log/fdm.log file, you might see entries similar to the following:

Invoking GetLockOwnerForVmOnNfsDs on Duplicate VM locked file; path: /vmfs/volumes/[datastoreID]/[VM_Name]/[VM_Name].vmx.lck
Nfs lock file name to read lock owner details; filename /vmfs/volumes/[datastoreID]/[VM_Name]/.lck-[ID]
Duplicate VM lock owner identified; path: /vmfs/volumes/[datastoreID]/[VM_Name]/[VM_Name].vmx.lck, owner: host-[ID]

Environment

  • VMware vSphere ESXi
  • VMware vCenter Server

Cause

This issue occurs due to a known bug in vSphere HA's split-brain detection mechanism with NFS datastores. When a VM migrates between hosts on NFS storage, there is a brief period where the lock ownership information appears ambiguous to the HA system. The HA component erroneously detects that the VM is running simultaneously on two hosts (a "split-brain" condition) and unnecessarily terminates one instance to "protect" the environment.

Specifically, the issue affects NFS datastores because of how they handle VM lock files differently than VMFS or vSAN datastores. During vMotion operations, vSphere HA may incorrectly identify VMs as being "duplicated" across hosts, triggering unnecessary failover remediation actions.

Resolution

For detailed resolution steps, please refer to Virtual Machine powered-off after vMotion and HA Failover

The resolution involves adding an advanced setting to disable duplicate VM detection in vSphere HA by setting `das.config.fdm.enableDupVmDetection = false` and then cycling vSphere HA off and on. The complete step-by-step instructions are available in the KB article linked above.

Warning: Please be aware that turning this setting off also allows actual "split-brain" issues to develop since the system won't be detecting them any more.

Important Notes:

  • After applying this setting, when upgrading hosts in the future, you may need to disable/enable HA twice to ensure it takes effect properly.
  • This setting is specifically designed to address the false duplicate VM detection issue on NFS datastores.
  • The setting will not affect other aspects of vSphere HA functionality.
  • While this issue may manifest with specific symptoms during certain time periods, the recommended solution addresses the root cause of the problem.

Additional Information

This issue is distinct from actual host failures and specifically relates to the HA duplicate VM detection mechanism.