vSphere HA restarted the VM
search cancel

vSphere HA restarted the VM

book

Article ID: 393668

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

  • One of the running virtual machines suddenly crashed and was automatically restarted by HA (High Availability).

  • No datastore disconnections were observed.
  • No hosts crashed or entered a "Not Responding" state.

  • It was found that the host lost access to the datastore, which caused the virtual machine lock to fail.

       In var/log/vmkernel.log you see the below entry

Environment

vSphere Esxi

Cause

host lost its heartbeat on the device due to ATS miscompare and it couldn't reclaim its HB within the time frame. VMFS uses on-disk locks to synchronize access with the shared datastores. It uses ATS to update(lock/unlock) these on Disk locks. Here, the onDisk images of the VMFS Metadata(Locks & HBs) are changing suddenly. Since they are going back, ATS on these locks & HBs are failing with ATS_MISCOMAPAREs and if the host loses the ATS Heart-beat and couldn't reclaim even after 16 seconds on the datastore then all the outstanding i/o's will be aborted which can crash the running VMs. 

In var/log/vmkernel.log you will see the below events

2025-04-04T06:20:11.571Z cpu0:2098023)WARNING: HBX: 318: This host lost connectivity to volume 638ec6ee-71957c38-4a69-a0f479a85f1f ("KBZDR_XXXXXXX") and subsequent recovery attempts have failed
2025-04-04T06:20:11.571Z cpu0:2098023)HBX: 6112: 'KBZXXXXXXXXXXXX': HB at offset 3211264 - Lost heartbeat. On-disk:

2025-04-04T06:06:53.555Z cpu1:2097218)ScsiDeviceIO: 4115: Cmd(0x45d99d739c48) 0xfe, CmdSN 0xfe from world 2097209 to dev "naa.600a098038304xxxxxxxxxxx" failed H:0x0 D:0x2 P:0x5
2025-04-04T06:09:38.558Z cpu25:2097247)ScsiDeviceIO: 4115: Cmd(0x45d99d71c148) 0xfe, CmdSN 0xcc from world 2097209 to dev "naa.600a09803830xxxxxxxxxxx" failed H:0x0 D:0x2 P:0x5
2025-04-04T06:20:11.570Z cpu24:2097242)HBX: 1949: ATS Miscompare detected between test and set HB images at offset 3211264 on vol '638ec6ee-71957c38-4a69-00########09'.
2025-04-04T06:20:11.570Z cpu24:2097242)HBX: 1950: 'KBZDR_KBZDR_XXXXXXX': HB at offset 3211264 - Test version:
2025-04-04T06:20:11.570Z cpu24:2097242)HBX: 1951: 'KBZDR_KBZDR_XXXXXXX': HB at offset 3211264 - Set version:
2025-04-04T06:20:11.571Z cpu0:2098023)HBX: 5828: 'KBZDR_ESX_KBZDR_XXXXXXX': HB at offset 3211264 - Cancelling all threads waiting for reclaim of HB:
2025-04-04T06:20:11.571Z cpu0:2098023)  [HB state abcdef02 offset 3211264 gen 102993 stampUS 25651094154711 uuid 66681611-84fca561-eb3a-00########09 jrnl <FB 50331665> drv 24.82 lockImpl 4 ip 10.11.126.16]
2025-04-04T06:20:11.571Z cpu0:2098023)WARNING: HBX: 318: This host lost connectivity to volume 638ec6ee-71957c38-4a69-00########09 ("KBXXXXXXXXX") and subsequent recovery attempts have failed
2025-04-04T06:20:11.571Z cpu0:2098023)HBX: 6112: 'XXXXXXXXXX': HB at offset 3211264 - Lost heartbeat. On-disk:

 

Resolution

The array vendor should be contacted to investigate the ATS I/O failure.