Virtual machines appear to be running or registered on multiple ESX/ESXi Servers in vSAN Stretched Cluster
search cancel

Virtual machines appear to be running or registered on multiple ESX/ESXi Servers in vSAN Stretched Cluster

book

Article ID: 319918

calendar_today

Updated On: 10-03-2024

Products

VMware vSAN

Issue/Introduction

This article provides instructions on how to detect the problem and terminate those VMs on the original host.

Symptoms:
After a site failure in a vSAN Stretched Cluster, virtual machines appear to be running and registered on both sites. In vCenter Server, the VMs will appear on one host for a few seconds, then it appears on the other host. The VMs appear to be jumping from host to host due to conflicts with the IP addresses being held by the original host.

This issue occurs when a vSAN Stretched Cluster encounters a site failure, and vSphere HA powers on all running virtual machines at the other site.
The disks for virtual machines in the original site lose accessibility to the vSAN datastore when they should actually be terminated by vSAN.

While rare, sometimes the ESXi host's memory can be over-subscribed and vSAN may fail to terminate all VMs running at the original site.

Environment

VMware vSAN 6.7.x

Cause

This issue occurs when vSAN fails to terminate running VMs that have lost disk accessibility in one site, while new instances of the VMs have already been powered on at the other site.

Resolution

Run the following commands on each ESXi host when a large number of VMs are facing this issue in a vSAN Stretched Cluster, or when vCenter resides as a VM on the vSAN Datastore and cannot be accessed due to aforementioned IP address issue:
  1. SSH to the host and login as root
  2. Run the esxcli vm process list command to get a list of running virtual machines in the host:
Example:
$ esxcli vm process list
vm1
   World ID: 1001723832
   Process ID: 0
   VMX Cartel ID: 1001723827
   UUID: 42 29 08 44 ## ## ## ##-## ## ## ## ee 4a 66 bf
   Display Name: vm1
   Config File: /vmfs/volumes/vsan:527b71e8########-######3d219c68b8/########-####-####-####-########20d4/vm1.vmx
  1. Run the esxcli network nic list command to get the list of MAC addresses of the physical network interfaces in this host:
Example:
$ esxcli network nic list
Name PCI Device Driver Admin Status Link Status Speed Duplex MAC Address MTU Description
------ ------------ -------- ------------ ----------- ----- ------ ----------------- ---- -----------------------------------------------
vmnic0 0000:0b:00.0 nvmxnet3 Up Up 10000 Full ##:##:##:##:##:9e 1500 VMware Inc. vmxnet3 Virtual Ethernet Controller
vmnic1 0000:13:00.0 nvmxnet3 Up Up 10000 Full ##:##:##:##:##:4d 1500 VMware Inc. vmxnet3 Virtual Ethernet Controller
vmnic2 0000:1b:00.0 nvmxnet3 Up Up 10000 Full ##:##:##:##:##:15 1500 VMware Inc. vmxnet3 Virtual Ethernet Controller
vmnic3 0000:04:00.0 nvmxnet3 Up Up 10000 Full ##:##:##:##:##:db 1500 VMware Inc. vmxnet3 Virtual Ethernet Controller
  1. Run vmfsfilelockinfo command for each VM with its VMX file path, to find out which MAC address is owning the lock.
Example:
$ /bin/vmfsfilelockinfo -p /vmfs/volumes/vsan:527b71e8########-######3d219c68b8/########-####-####-####-########20d4/vm1.vmx
vmfsfilelockinfo Version 2.0
Looking for lock owners on "vm1.vmx"
"vm1.vmx" is locked in Exclusive mode by host having mac address ['##:##:##:##:##:15']
Please configure ESXi firewall to connect to Virtual Center
Total time taken : 1.0551715530455112 seconds.

Note: If the MAC address is owned by local host, it means the running virtual machine still owns its lock; otherwise it loses the lock so it is okay to terminate the virtual machine. vSphere HA might have already started the virtual machine in other host. Otherwise, after the virtual machine is terminated, HA will try to restart it.
 
You may run below shell script in ESXi host to orchestrate above steps:
esxcli network nic list > /tmp/mac.list
esxcli vm process list > /tmp/vm.list


while IFS= read -r line; do
   if echo $line | grep -v ":" > /dev/null; then
      echo "Checking VM: $line"
   elif echo $line |grep "World ID:" > /dev/null; then
      VM_WLD_ID=$(echo $line |grep -o "[0-9]*")
   elif echo $line | grep "Config File:" > /dev/null; then
      VMX_FILE=$(echo $line |grep -o "/vmfs/.*")
      LOCKING_MAC=$(/bin/vmfsfilelockinfo -p $VMX_FILE |grep "mac address")
      STRIP_MAC=$(echo $LOCKING_MAC | grep -o "\[.*\]" |grep -o "[0-9a-f:]*")
      grep $STRIP_MAC /tmp/mac.list > /dev/null
      if [ $? -ne 0 ]; then
         echo " Error: VM does not hold the lock. You may run this command to terminate the VM:"
         echo " esxcli vm process kill -t=hard -w=$VM_WLD_ID"
      fi
   fi
done < /tmp/vm.list


Example:
$ ./lost_lock_vm.sh
Checking VM: vm1
  Error: VM does not hold the lock. You may run this command to terminate the VM:
  esxcli vm process kill --type=hard --w=1001723832