Handling Split Brain scenario in vSphere

Products

VMware Cloud on AWS VMware vSphere ESX 6.x VMware vSphere ESX 7.x

Issue/Introduction

This article explains how to identify and kill the vmx processes which have lost control over the VM.

Symptoms:

HA enabled vCenter clusters may end up seeing multiple instances of a VM running on different ESXi hosts. The VC UI may see the owning host of the VM flapping between where the VMs are running.

VMs sharing host resources with VMs in the split brain scenario may witness a "DRS Storm" where many VMs will rapidly migrate between hosts with the split brain VMs. This can result in degradation of performance on VMs during migration and possible resource over commitment on affected hosts.

Cause

We can encounter this problem in a vSphere HA environment in the following way(s):

Let’s say a host H1 has a VM running on it which is protected by HA.

Scenario 1)

If the host isolation response in vSphere HA settings is set to “Disabled” and host H1 becomes network isolated, then HA primary present in the other partition will failover the VM running on the isolated host to other hosts in the cluster. When the isolated host joins back to the network, there will be two instances of the VM running on different hosts in the cluster.

Scenario 2)

Let's say virtual machine VM1 is stored on datastore DS1 and is registered and running on Host H1. If DS1 hits APD and let's say host H1 becomes network isolated from other hosts in the cluster around the same time. FDM Primary in other partition will mark host H1 as Dead and failover the VMs to other hosts in the cluster. When a datastore hits APD, FDM waits for APD timeout (140 secs by default) + VMCP timeout (180 secs by default) to take any action. FDM will act on the selected APD policy ONLY if APD timeout + VMCP timeout gets expired.

If APD gets cleared before the timeout, FDM wouldn't act upon the VMCP policy. If you have selected "vmReactionOnAPDCleared" as "none", FDM wouldn't take any action when APD got cleared and there will be split-brain scenario when the partition gets resolved.

Scenario 3)

The host isolation address remains accessible to the ESXi host however connectivity to HA cluster is lost for a short period. When the isolated host joins back to the network, there will be two instances of the VM running on different hosts in the cluster.

Resolution

There will be a VMX process running on two different ESXi hosts. Only one of them will be holding the lock for its files. To resolve the split-brain scenario, we need to identify the host in which the VMX process is holding the lock on VM files and power-off the VM in the other host (the one which hasn’t held the lock).

1. Identify the host which has locked the VMX files

For VMFS, VSAN, VVOL datastores:

Method 1:

Follow the steps mentioned in this KB article: Investigating virtual machine file locks on ESXi

Method 2:

1. Login to an ESXi host
Run the below command “vmkfstools -D <path-to-vmx-lck file>”
vmkfstools -D /vmfs/volumes/5df95a18-########-####-########cfd/New\ Virtual\ Machine/New\ Virtual\ Machine.vmx.lck

Lock [type 10c00001 offset 8003584 v 106, hb offset 3670016
gen 13115, mode 1, owner 5df9572c-########-####-##########81 mtime 2144440
num 0 gblnum 0 gblgen 0 gblbrk 0]
Addr <4, 0, 65>, gen 95, links 1, type reg, flags 0xa, uid 0, gid 0, mode 600
len 1073741824, nb 1024 tbz 1024, cow 0, newSinceEpoch 1024, zla 3, bs 1048576
affinityFD <4,0,62>, parentFD <4,0,62>, tbzGranularityShift 20, numLFB 0
lastSFBClusterNum 15, numPreAllocBlocks 0, numPointerBlocks 1

The output contains information regarding “owner” of the file. From the above output,
owner 5df9572c-########-####-##########81

The last section of 5df9572c-########-####-##########81 (which is 02004b84b281) is the MAC address of the host which owns the file.
The next step is to find out the host which has the MAC address 02004b84b281. One way of finding this out is to login to esxi host and the following command

esxcfg-nics -l

Name PCI Driver Link Speed Duplex MAC Address MTU Description
vmnic0 0000:0b:00.0 ne1000 Up 1000Mbps Full 02:00:4b:##:##:## 1500 Intel Corporation Virtual 82574L Gigabit Ethernet
vmnic1 0000:13:00.0 ne1000 Up 1000Mbps Full 02:00:4b:##:##:## 1500 Intel Corporation Virtual 82574L Gigabit Ethernet
vmnic2 0000:1b:00.0 ne1000 Up 1000Mbps Full 02:00:4b:##:##:## 1500 Intel Corporation Virtual 82574L Gigabit Ethernet
vmnic3 0000:04:00.0 ne1000 Up 1000Mbps Full 02:00:4b:##:##:81 1500 Intel Corporation Virtual 82574L Gigabit Ethernet

If one of the vmnics has the MAC address 02004b####81, then we found the host which has held the locks for VM.

2. Power-off the VM from the other host which doesn't have the lock

In order to power-off the VM, we can do the following after logging into ESXi host:

a. Run the vim-cmd vmsvc/getallvms command to display the names of the virtual machines registered on this host.
b. Take note of the impacted virtual machine ID “VMID”.
c. Power-off the virtual machine with the following command:

vim-cmd vmsvc/power.off VMID

d. Run the vim-cmd vmsvc/getallvms command and see if the stale VM still exists on the host.
e. If the powered-off copy of the VM still exists on the host, unregister the VM with

vim-cmd vmsvc/unregister VMID
For more information refer to VMware KB : Investigating virtual machine file locks on ESXi

For NFS
To identify the host which has locked the VMX files refer to VMware KB below
Understanding the NFS .lck lock file to understand the ESX host and NFS filename it refers to

Note:The preceding log excerpts are only examples. Date, time and environmental variables may vary depending on your environment

Workaround:
If these symptoms are seen on VMC on AWS please contact VMware support to address.

Additional Information

Impact/Risks:
No impact