Virtual machines fail to power on with a "Bad address" error after rebooting the ESX host

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

After rebooting the ESX host, some virtual machines fail to power on and the VMDK files do not open
You see the error:

[2007-06-03 17:58:58.203 'vm:/vmfs/volumes/459bc363-204e26e4-a2bf-0018717988b8/My_Test_VM/My_Test_VM.vmx' 7977904 info] Question info: Cannot open the disk '/vmfs/volumes/459bc363-204e26e4-a2bf-0018717988b8/My_Test_VM/My_Test_VM-000002.vmdk' or one of the snapshot disks it depends on. Reason: Bad address., Id: 0 : Type : 2, Default: 0, Number of options: 1

Note: You see this error in VMware Infrastructure (VI) Client and in / var/log/vmware/hostd-0.log when you try to power on the virtual machines, and if you try to clone a virtual disk using vmkfstools -i.
/var/log/vmkernel contains the following entry, which identifies an inaccessible device:

Jun 3 21:04:08 my_esx_svr1 vmkernel: 0:00:14:38.197 cpu2:1034)ALERT: LVM: 1355: One or more devices not found (file system [My_VMFS_storage, 459bc363-204e26e4-a2bf-0018717988b8])
The output of the vmkfstools -P <datastore> command shows one or more extents are missing from the datastore:

#vmkfstools -P /vmfs/volumes/My_VMFS_Storage
VMFS-3.21 file system spanning 1 partitions.
File system label (if any): My_VMFS_Storage
Mode: public
Capacity 1174136684544 (1119744 file blocks * 1048576), 73270296576 (69876 blocks) avail
UUID: 459bc363-204e26e4-a2bf-0018717988b8
Partitions spanned:
vmhba0:0:1:1
(One or more partitions spanned by this volume may be offline)

Environment

VMware VirtualCenter 1.4.x
VMware ESX Server 3.0.x
VMware VirtualCenter 2.0.x
VMware ESX Server 2.5.x

Resolution

This issue can occur if a VMFS volume is spanned over multiple LUNs and one or more of the extents is missing or corrupted.

In this scenario, if some LUNs go offline and only a few come back up again, the ESX host's /var/log/vmkernel log file identifies the LUNs that have LVM volumes missing. For example:

Jun 3 21:38:35 my_esx_svr1 vmkernel: 0:00:01:04.626 cpu3:1034)ALERT: LVM: 1355: One or more devices not found (file system [My_VMFS_Storage, 459bc363-204e26e4-a2bf-0018717988b8])

My_VMFS_storage is the VMFS datastore that is missing extents. Any virtual disks that are partially contained on missing extents are inconsistent. ESX hosts cannot open inconsistent virtual disk files. When you try to power on the virtual machine, you see the error:

[2007-06-03 17:58:58.203 'vm:/vmfs/volumes/459bc363-204e26e4-a2bf-0018717988b8/My_Test_VM/My_Test_VM.vmx' 7977904 info] Question info: Cannot open the disk '/vmfs/volumes/459bc363-204e26e4-a2bf-0018717988b8/My_Test_VM/My_Test_VM-000002.vmdk' or one of the snapshot disks it depends on. Reason: Bad address., Id: 0 : Type : 2, Default: 0, Number of options: 1

To check if a VMFS volume has all its VMFS extents, run the command:

# vmkfstools -P /vmfs/volumes/My_VMFS_Storage

The output appears similar to:

VMFS-3.21 file system spanning 1 partitions.
File system label (if any): My_VMFS_Storage
Mode: public
Capacity 1174136684544 (1119744 file blocks * 1048576), 73270296576 (69876 blocks) avail
UUID: 459bc363-204e26e4-a2bf-0018717988b8
Partitions spanned:
vmhba0:0:1:1
(One or more partitions spanned by this volume may be offline)

This output indicates that the VMFS volume My_VMFS_Storage spans across more than one extent and that they are missing. The extents can be VMFS partitions on this same LUN or other LUNs.

If vmkfstools -P shows that any extents are missing, do no reboot the host or perform a storage rescan. Rebooting the hosts that access the missing, inaccessible, or corrupted VMFS extents removes the virtual machines running in the missing extents from the cache. As long as the ESX host has not rebooted and as long as the SAN has not been rescanned, VMware Technical Support may be able to recover the data within the VMFS extent(s). VMware recommends that you migrate all working virtual machines away from the affected volume.

You must collect a dump of the first MB of the applicable device to give to VMware Technical Support.

To collect a dump:

Run the following command to identify the mapping of the vmhba name and the /dev node:

esxcfg-vmhbadevs -q

Note: In this example, the device is /dev/sda.
Create metadata dumps from the main LUN with the command:

dd if=/dev/sda of=/tmp/sda.bin bs=1k count=32768
md5sum /tmp/sda.bin > /tmp/sda.bin.md5sum
Collect the metadata dumps from the VMFS partition(s) on this LUN with the commands:

dd if=/dev/sda1 of=/tmp/sda1.bin bs=1k count=32768
md5sum /tmp/sda1.bin > /tmp/sda1.bin.md5sum

dd if=/dev/sda2 of=/tmp/sda2.bin bs=1k count=32768
md5sum /tmp/sda2.bin > /tmp/sda2.bin.md5sum

These commands create a binary dump image files and md5sum signature from the LUN any VMFS partitions in it. The md5sum signature is important to collect to insure the integrity of the binary dump files during the SFTP/FTP transfer to VMware Technical Support.
Provide this information to VMware Technical Support. For more information, see Uploading diagnostic information to VMware (1008525).
If the virtual machines are accessible via RDP, try using VMware Converter to move the virtual machines to another datastore. VMware Technical Support may be able to run backup software within the virtual machine to get data off.

To prevent this issue, implement two-member zoning on the Fabric Switch to ensure that you do not receive LIP / RSCN resets when any host accessing shared LUNs is rebooted. Excessive SCSI bus resets can cause LUN corruption. Two-member zoning is a zone that contains a single initiator and a single target. This means that a single ESX host with four paths to the array is contained in four zones, one per initiator/target pair.