VMs Entering Hung State Due to VMFS Datastore Corruption and Duplicate MAC Addresses on hosts
search cancel

VMs Entering Hung State Due to VMFS Datastore Corruption and Duplicate MAC Addresses on hosts

book

Article ID: 386947

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

  • Multiple VMs may frequently enter a hung or unresponsive state.
  • When virtual machines are tried to power on, the process fails
  • The issue is more prevalent when Cisco UCS servers which is using Cisco Systems Inc Cisco VIC Ethernet NIC.
 
Validation Steps: 
 
1. validate that the impacted VMs are on the same ESXi host or different hosts. 
  • In the vSphere Client, navigate to the VMs tab.
  • Select an affected VM and check the Summary page.
  • Verify that the VM is residing on which ESXi host and datastore. If multiple VMs are experiencing similar issues, check their location to see if they reside on the same host and datastore.
 
 
 
2. Confirm whether the datastore is shared by multiple ESXi hosts and if the issue is observed across VMs on different hosts using the same datastore
  • In the vSphere Client, go to Storage > Datastores> Hosts 
  • Check if the datastore is shared by multiple ESXi hosts and if the symptoms (hung VMs) are observed on VMs from different hosts that share the same datastore

 
3. Gather the NAA ID (Network Address Authority) of the affected datastore to further investigate storage-related issues
 
  • In the vSphere Client, go to Storage > Datastore.
  • Select the datastore in question.
  • Navigate to Configure > Device Banking.
  • Note the NAA ID of the datastore, which will be used for further investigation in storage logs.

Investigate the VMkernel logs to identify any issues related to the datastore that might be causing VM failures or hung states
 
Log Path : cat /var/run/log/vmkernel.log 
 
keyworks to validate in the logs : 
  •  Lost previously held disk lock
  •  Invalid metadata
  •  Stale HB slot(s) owned by me have been garbage collected on vol
  • Optimistic lock acquired by another host
Logs: 
 
2024-12-15T22:04:22.130Z cpu70:2097806)BC: 414: write to vmware.log (f532 28 3 61791520 67246232 25007d9f 1e011cb5 ec00004 9541 0 0 0 0 0) 618445 bytes failed: Lost previously held disk lock
2024-11-27T12:10:24.387Z cpu29:2100987)WARNING: FS3J: 2246: Error freeing journal block <JB cnum 6 rnum 1> (returned 0) for 645cac24-########-bbc0-#########: Invalid metadata
2024-11-27T12:10:24.387Z cpu29:2100987)WARNING: HBX: 3826: Cannot free journal <type 6 addr 8388614> on vol 'XXXX-XXXXX-CXXXX-XXXX'
2024-11-27T12:10:24.389Z cpu36:2101501)HBX: 6554: '####-###-####-##': HB at offset 3178496 - Marking HB:
2024-11-27T12:10:24.389Z cpu36:2101501)  [HB state abcdef02 offset 3178496 gen 39823 stampUS 5464341924766 uuid ######-#####-####-####jrnl <FB 16777222> drv 24.82 lockImpl 4 ip 10.###.####.175]
2024-11-27T12:10:24.389Z cpu36:2101501)HBX: 6558: HB at 3178496 on vol '####-###-##-##' replayHostHB: 0 replayHostHBgen: 0 replayHostUUID:  (00000000-00000000-0000-000000000000).
2024-11-27T12:10:24.389Z cpu36:2101501)HBX: 6673: '#############': HB at offset 3178496 - Marked HB:
2024-11-27T12:10:24.389Z cpu36:2101501)  [HB state abcdef04 offset 3178496 gen 39823 stampUS 182604881 uuid 66f3abeb-########-#####-0025b5151a5d jrnl <FB 16777222> drv 24.82 lockImpl 4 ip 10.#####.####.175]
2024-11-27T12:10:24.389Z cpu36:2101501)FS3J: 4387: Replaying journal at <type 6 addr 16777222>, gen 39823
2024-11-27T12:10:24.400Z cpu36:2101501)HBX: 4726: 1 stale HB slot(s) owned by me have been garbage collected on vol '##################'
2024-11-27T12:10:24.402Z cpu36:2101501)WARNING: FS3: 608: VMFS volume ######################### on naa.######################:1 has been detected corrupted
2024-11-27T12:10:24.402Z cpu36:2101501)FS3: 610: While filing a PR, please report the names of all hosts that attach to this LUN, tests that were running on them,
2024-11-27T12:10:24.402Z cpu36:2101501)FS3: 634: and upload the dump by `voma -m vmfs -f dump -d /vmfs/devices/disks/naa.##########################:1 -D X`
2024-11-27T12:10:24.402Z cpu36:2101501)FS3: 641: where X is the dump file name on a DIFFERENT volume
2024-11-27T12:10:24.402Z cpu36:2101501)FS3: 374: FS3RCMetaVMFS6 0 0 1919118692 0 6 8 8 0 0
2024-11-27T12:10:24.402Z cpu36:2101501)FS3: 379: 0 0 0 0 1 0 0 0 0 0
2024-11-27T12:10:24.402Z cpu36:2101501)FS3: 384: 0 0 00000000-00000000-0000-000000000000
2024-11-27T12:10:24.402Z cpu36:2101501)FS3: 388: 34004 1732709393 0 37 181 21 26 93
2024-11-27T12:10:24.402Z cpu36:2101501)FS3: 395: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 
fol_06/tdlog/logs/vmkernel.all:2024-12-25T15:01:43.905Z cpu20:86485154)FS3J: 3239: Cancelling txn (0x43153d853200) callerID: 0xc1d00002 due to failurepre-committing: Optimistic lock acquired by another host
fol_06/tdlog/logs/vmkernel.all:2024-12-25T18:01:40.906Z cpu65:86603295)FS3J: 3239: Cancelling txn (0x43153e093400) callerID: 0xc1d00001 due to failurepre-committing: Optimistic lock acquired by another host
fol_06/tdlog/logs/vmkernel.all:2024-12-31T00:21:48.231Z cpu103:2097413)FS3J: 3239: Cancelling txn (0x43153e800c00) callerID: 0xc1d00002 due to failurepre-committing: Optimistic lock acquired by another host
fol_06/tdlog/logs/vmkernel.all:2025-01-03T06:46:06.803Z cpu19:94524201)FS3J: 3239: Cancelling txn (0x43153cccfa00) callerID: 0xc1d00002 due to failurepre-committing: Optimistic lock acquired by another host
 
 


Environment

VMware vSphere ESXi 7.0.x

VMware vSphere ESXi 8.0.x

Cause

The issue occurs due to the use of the Cisco VIC (Virtual Interface Card) Ethernet NIC , which presents network interfaces with the same MAC address to multiple ESXi hosts.

  • VMFS Design Consideration: VMFS5/6 does not expect network adapters with identical MAC addresses to be attached to more than one ESXi host within the same cluster that accesses a shared VMFS datastore.
  • When multiple ESXi servers in the cluster have NICs with duplicate MAC addresses, it can lead to conflicts affecting storage access, networking operations, or other cluster-related functionality.

Cause Validation 

Collect NIC Details from ESXi Hosts Connected to the Affected LUN


1. If the LUN is connected to multiple ESXi hosts (for example, 30 hosts), gather the NIC details from each host by running the following command on all affected hosts

    esxcfg-nics -l

ESXI host 1 : 

Name    PCI          Driver      Link Speed      Duplex MAC Address       MTU    Description
vmnic0  0000:1a:00.0 nenic       Up   50000Mbps  Full   ##:##:b5:15:1a:33  1500   Cisco Systems Inc Cisco VIC Ethernet NIC   >>>>>>>>>>Same MAC Address 
vmnic1  0000:1a:00.1 nenic       Up   50000Mbps  Full   ##:##:##:15:1a:34  9000   Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic2  0000:1a:00.2 nenic       Up   50000Mbps  Full   ##:##:##:15:1a:35  1500   Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic3  0000:cc:00.0 nenic       Up   50000Mbps  Full   ##:##:##:15:1b:33  1500   Cisco Systems Inc Cisco VIC Ethernet NIC

ESXI host 2: 

Name    PCI          Driver      Link Speed      Duplex MAC Address       MTU    Description
vmnic0  0000:62:00.0 nenic       Up   20000Mbps  Full   ##:##:##:##:1a:31  1500   Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic1  0000:62:00.1 nenic       Up   20000Mbps  Full   ##:##:##:##:1b:31  1500   Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic2  0000:62:00.2 nenic       Up   20000Mbps  Full   ##:##:##:##:1a:32  1500   Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic3  0000:dc:00.0 nenic       Up   20000Mbps  Full   ##:##:##:##:1b:32   1500   Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic4  0000:dc:00.1 nenic       Up   20000Mbps  Full   ##:##:##:##:1a:33   9000   Cisco Systems Inc Cisco VIC Ethernet NIC   >>>>>>>>>>> Same MAC Address 

2. Get uuid details of esxi host by running below command 

      localcli system uuid get  >>> Confirm ESXI hosts have Different UUID 

Resolution

Once the Same MAC address found for different ESXI hosts we need to follow below Steps

1. Run the below command from ESXi Host. 
 
There is an advanced ESXi setting called FollowHardwareMac that will automatically update the VMkernel's MAC Address whenever the network adapter MAC Addresses changes. To do so, you will need to run the following ESXCLI command:
 
esxcli system settings advanced set -o /Net/FollowHardwareMac -i 1
 

2. Once the above parameter is set reboot the hosts.  

Before running esxcli system settings advanced set -o /Net/FollowHardwareMac -i 1   output of esxcfg-nics -l

ESXI host 1 : 

Name    PCI          Driver      Link Speed      Duplex MAC Address       MTU    Description
vmnic0  ####:##:00.0 nenic       Up   50000Mbps  Full   ##:##:##:15:1a:33  1500   Cisco Systems Inc Cisco VIC Ethernet NIC   >>>>>>>>>>Same MAC Address 
vmnic1  ####:##:00.1 nenic       Up   50000Mbps  Full   ##:##:##:##:1a:34  9000   Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic2  ####:##:00.2 nenic       Up   50000Mbps  Full   ##:##:##:##:1a:35  1500   Cisco Systems Inc Cisco VIC Ethernet NIC


ESXI host 2: 

Name    PCI          Driver      Link Speed      Duplex MAC Address       MTU    Description
vmnic0  ####:##:00.0 nenic       Up   20000Mbps  Full   ##:##:##:##:1a:31  1500   Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic4  ####:##:00.1 nenic       Up   20000Mbps  Full   ##:##:##:##:1a:33   9000   Cisco Systems Inc Cisco VIC Ethernet NIC   >>>>>>>>>>> Same MAC Address 
vmnic5  ####:##:00.2 nenic       Up   20000Mbps  Full   ##:##:##:##:1b:33   9000   Cisco Systems Inc Cisco VIC Ethernet NIC

After running command esxcli system settings advanced set -o /Net/FollowHardwareMac -i 1 the output of esxcfg-nics -l 

ESXI host 1 : 

Name    PCI          Driver      Link Speed      Duplex   MAC Address       MTU    Description
vmnic0  #####:##:00.0 nenic      Up   50000Mbps  Full     ##:##:##:##:1a:33  1500   Cisco Systems Inc Cisco VIC Ethernet NIC   >>>>>>>>>>Same MAC Address 
vmnic1  ####:##:00.1 nenic       Up   50000Mbps  Full     ##:##:##:##:1a:34  9000   Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic2  ####:##:00.2 nenic       Up   50000Mbps  Full     ##:##:##:##:1a:35  1500   Cisco Systems Inc Cisco VIC Ethernet NIC
 

ESXI host 2: 

Name    PCI          Driver      Link Speed      Duplex MAC Address       MTU    Description
vmnic0  ####:##:00.0 nenic       Up   20000Mbps  Full   ##:##:##:##:1a:31  1500   Cisco Systems Inc Cisco VIC Ethernet NIC
vmnic4  ####:##:00.1 nenic       Up   20000Mbps  Full   ##:##:##:##:1a:36  9000   Cisco Systems Inc Cisco VIC Ethernet NIC   >>>>>>>>>>> Different MAC Address 
vmnic5  ####:##:00.2 nenic       Up   20000Mbps  Full   ##:##:##:##:1b:33   9000   Cisco Systems Inc Cisco VIC Ethernet NIC

When the issue occurs on Cisco UCS servers using Cisco Systems Inc. Cisco VIC Ethernet NIC, follow this additional step from Cisco:

  1. In Cisco UCS Manager, select the AUTO radio button to assign a new MAC address.