We can encounter this problem in a vSphere HA environment in the following way(s):
Let’s say a host H1 has a VM running on it which is protected by HA.
If the host isolation response in vSphere HA settings is set to “Disabled” and host H1 becomes network isolated, then HA primary present in the other partition will failover the VM running on the isolated host to other hosts in the cluster. When the isolated host joins back to the network, there will be two instances of the VM running on different hosts in the cluster.
Let's say virtual machine VM1 is stored on datastore DS1 and is registered and running on Host H1. If DS1 hits APD and let's say host H1 becomes network isolated from other hosts in the cluster around the same time. FDM Primary in other partition will mark host H1 as Dead and failover the VMs to other hosts in the cluster. When a datastore hits APD, FDM waits for APD timeout (140 secs by default) + VMCP timeout (180 secs by default) to take any action. FDM will act on the selected APD policy ONLY if APD timeout + VMCP timeout gets expired.
If APD gets cleared before the timeout, FDM wouldn't act upon the VMCP policy. If you have selected "vmReactionOnAPDCleared" as "none", FDM wouldn't take any action when APD got cleared and there will be split-brain scenario when the partition gets resolved.
Scenario 3)
The host isolation address remains accessible to the ESXi host however connectivity to HA cluster is lost for a short period. When the isolated host joins back to the network, there will be two instances of the VM running on different hosts in the cluster.
There will be a VMX process running on two different ESXi hosts. Only one of them will be holding the lock for its files. To resolve the split-brain scenario, we need to identify the host in which the VMX process is holding the lock on VM files and power-off the VM in the other host (the one which hasn’t held the lock).
For VMFS, VSAN, VVOL datastores:
Method 1:
Follow the steps mentioned in this KB article: Investigating virtual machine file locks on ESXi
1. Login to an ESXi host
Run the below command “vmkfstools -D <path-to-vmx-lck file>”
vmkfstools -D /vmfs/volumes/5df95a18-########-####-########cfd/New\ Virtual\ Machine/New\ Virtual\ Machine.vmx.lck
Lock [type 10c00001 offset 8003584 v 106, hb offset 3670016
gen 13115, mode 1, owner 5df9572c-########-####-##########81 mtime 2144440
num 0 gblnum 0 gblgen 0 gblbrk 0]
Addr <4, 0, 65>, gen 95, links 1, type reg, flags 0xa, uid 0, gid 0, mode 600
len 1073741824, nb 1024 tbz 1024, cow 0, newSinceEpoch 1024, zla 3, bs 1048576
affinityFD <4,0,62>, parentFD <4,0,62>, tbzGranularityShift 20, numLFB 0
lastSFBClusterNum 15, numPreAllocBlocks 0, numPointerBlocks 1
The output contains information regarding “owner” of the file. From the above output,
owner 5df9572c-########-####-##########81
The last section of 5df9572c-########-####-##########81 (which is 02004b84b281) is the MAC address of the host which owns the file.
The next step is to find out the host which has the MAC address 02004b84b281. One way of finding this out is to login to esxi host and the following command
esxcfg-nics -l
Name PCI Driver Link Speed Duplex MAC Address MTU Description
vmnic0 0000:0b:00.0 ne1000 Up 1000Mbps Full 02:00:4b:##:##:## 1500 Intel Corporation Virtual 82574L Gigabit Ethernet
vmnic1 0000:13:00.0 ne1000 Up 1000Mbps Full 02:00:4b:##:##:## 1500 Intel Corporation Virtual 82574L Gigabit Ethernet
vmnic2 0000:1b:00.0 ne1000 Up 1000Mbps Full 02:00:4b:##:##:## 1500 Intel Corporation Virtual 82574L Gigabit Ethernet
vmnic3 0000:04:00.0 ne1000 Up 1000Mbps Full 02:00:4b:##:##:81 1500 Intel Corporation Virtual 82574L Gigabit Ethernet
If one of the vmnics has the MAC address 02004b####81, then we found the host which has held the locks for VM.
2. Power-off the VM from the other host which doesn't have the lock
In order to power-off the VM, we can do the following after logging into ESXi host:
a. Run the vim-cmd vmsvc/getallvms command to display the names of the virtual machines registered on this host.
b. Take note of the impacted virtual machine ID “VMID”.
c. Power-off the virtual machine with the following command:
vim-cmd vmsvc/power.off VMID
d. Run the vim-cmd vmsvc/getallvms command and see if the stale VM still exists on the host.
e. If the powered-off copy of the VM still exists on the host, unregister the VM with
vim-cmd vmsvc/unregister VMID
For more information refer to VMware KB : Investigating virtual machine file locks on ESXi
For NFS
To identify the host which has locked the VMX files refer to VMware KB below
Understanding the NFS .lck lock file to understand the ESX host and NFS filename it refers to
Note:The preceding log excerpts are only examples. Date, time and environmental variables may vary depending on your environment
Workaround:
If these symptoms are seen on VMC on AWS please contact VMware support to address.