VMFS datastore unmounted after Esxi reboot due to corruption
search cancel

VMFS datastore unmounted after Esxi reboot due to corruption

book

Article ID: 393126

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms 

  • If the affected datastore is shared with other hosts, the files may not be accessible or browsable.

      path : vSphere-->Datastore -->files 

             

Validation Steps: 

  • esxcli storage vmfs extent list does not list affected datastore (unmounted after reboot)
  • partedUtil getptbl /vmfs/devices/disks/naa.XXXX shows the partition details with overwritten 

         [root@##-vmhost##:/vmfs/volumes/5e9fccf2-####-####-#######] partedUtil getptbl /vmfs/devices/disks/naa.6000d310059########

         The primary GPT table is corrupt, but the backup appears OK, so that will be used. Fix primary table ? diskPath (/dev/disks/naa.################################) diskSize (5368709120) AlternateLBA  (1) LastUsableLBA (5368709120)

          gpt
        334186 255 63 5368709120
        1 2048 5368709086 AA31E02A400F11DB9590000C2911D1B8 vmfs 0

  • Below errors will be seen on all the esxi host if the datastore is shared. 

    Log path : less /var/run/log/vmkernel.log 

    2025-03-30T23:26:44.855Z cpu12:3055997)WARNING: FS3J: 3480: Committing transaction failed: Timeout 
    2025-03-30T23:26:44.916Z cpu4:3055998)HBX: 3063: 'ME-#####-###-17': HB at offset 3637248 - Waiting for timed out HB:
    2025-03-30T23:26:44.916Z cpu4:3055998)  [HB state ####### offset 3637248 gen 769635 stampUS 2016038403857 uuid 67c95c0e-#####-####-####868621 jrnl <FB 33554433> drv 24.82 lockImpl 4 ip 10.33.5.
    18]
    2025-03-30T23:26:44.992Z cpu0:2101122 opID=f917fd43)World: 12077: VC opID HB-SpecSync-host-######-62d452a1-b7-#### maps to vmkernel opID f917fd43
    2025-03-30T23:26:44.992Z cpu0:2101122 opID=f917fd43)SchedVsi: 2083: Group: host/user/pool0(42162): min=132487 max=unlimited minLimit=unlimited shares=11908, units: mb
    2025-03-30T23:26:54.758Z cpu12:2277911)HBX: 3063: 'ME-####-###-17': HB at offset 3637248 - Waiting for timed out HB:
    2025-03-30T23:26:54.758Z cpu12:2277911)  [HB state abcdef02 offset 3637248 gen 769635 stampUS 2016038403857 uuid 67c95c0e-#####-###-#####68621 jrnl <FB 33554433> drv 24.82 lockImpl 4 ip 10.33.5
    .18]
    2025-03-30T23:27:00.015Z cpu14:3055997)HBX: 3063: 'ME-####-##-17': HB at offset 3637248 - Waiting for timed out HB:
    2025-03-30T23:27:00.015Z cpu16:3055996)HBX: 3063: 'ME-###-##-17': HB at offset 3637248 - Waiting for timed out HB:   >>>>>>A timeout can occur if the heartbeat is delayed or missed, which could be caused by several factors like storage subsystem issues, or resource contention
    2025-03-30T23:27:00.015Z cpu14:3055997)  [HB state abcdef02 offset 3637248 gen 769635 stampUS 2016038403857 uuid 67c95c0e-###-####-####8868621 jrnl <FB 33554433> drv 24.82 lockImpl 4 ip 10.33.5
    .18]
    2025-03-30T23:27:00.015Z cpu16:3055996)  [HB state abcdef02 offset 3637248 gen 769635 stampUS 2016038403857 uuid 67c95c0e-#####-####-######## jrnl <FB 33554433> drv 24.82 lockImpl 4 ip 10.33.5
    .18]
    2025-03-30T23:27:01.557Z cpu2:2097706)NMP: nmp_ResetDeviceLogThrottling:3784: last error status from device naa.6000d3100595##############repeated 1 times
    2025-03-30T23:27:04.759Z cpu12:2277911)HBX: 3063: 'ME-####-###-17': HB at offset 3637248 - Waiting for timed out HB:
    2025-03-30T23:27:04.759Z cpu12:2277911)  [HB state abcdef02 offset 3637248 gen 769635 stampUS 2016038403857 uuid 67c95c0e-#####-###-####### jrnl <FB 33554433> drv 24.82 lockImpl 4 ip 10.##.##

Cause

  • Since the heartbeat timeout on particular offset lead to LUN corruption and primary partition table is overwritten. 

Cause Validation

  • Run VOMA check as per the KB : 318894

    [root@##-vmhost##:/vmfs/volumes/5e9fccf2-####-####-#######] voma -m vmfs -f check -d /vmfs/devices/disks/naa.6000########################
    Running VMFS Checker version 2.1 in check mode
    Initializing LVM metadata, Basic Checks will be done
    ERROR: IO failed: Connection timed out
    Initializing LVM metadata .. \
    LVM magic not found at expected Offset,    >>>>>>>>>>>>>>>>>>>>>>>>>>>>Unable to found LVM magic 
    It might take long time to search in rest of the disk.
    VMware ESXi Question:
    Do you want to continue (Y/N) ?
    Yes
    No
    Select a number from 0-1: 0
    Searching the device for LVM Magic ..

 

Resolution

If the VMFS metadata is overwritten then need to do one of the following: