VMFS Resignature causes thrashing between multiple VMware ESXi 4.x/5.x and ESX 4.x hosts

search cancel

VMFS Resignature causes thrashing between multiple VMware ESXi 4.x/5.x and ESX 4.x hosts

book

Article ID: 311337

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

After a LUN ID presentation change has been performed, a datastore has become inaccessible, or intermittently inaccessible.
You have a VMware vSphere cluster consisting of two or more VMware ESXi/ESX hosts
A LUN presentation change made to one or more VMware ESX hosts in the cluster either:
- Caused non-uniform presentation (a host/initiator has unique presentation compared to its peers).
- Corrected a non-uniform presentation (LUN IDs have been made consistent across hosts/initiators).
Your SAN LUNs do not support Network Address Authority, or it is disabled, relying upon LUN IDs and serial numbers.
In VMware vCenter, the datastore appears for several seconds at a time before disappearing and re-appearing.
The datastore name comprises of a prefix: snap-<UUID>-<Datastore Name>. The prefix changes each time the datastore reappears.
The virtual machines on the datastore are inaccessible, cannot be powered on, or have failed.
The Advanced Setting LVM.enableResignature is set to 1 on more than one VMware ESX host.

Environment

VMware ESXi 4.1.x Installable
VMware vSphere ESXi 5.0
VMware ESXi 4.0.x Installable
VMware ESXi 4.0.x Embedded
VMware ESX 4.0.x
VMware ESXi 4.1.x Embedded
VMware ESX 4.1.x

Resolution

This issue only occurs when a combination of all of these points take place:

Disabled or no SAN array support for LUN Network Address Authority
Non-uniform LUN presentation or non-uniform host-side device registration for the initiators or hosts.
LVM.enableResignature is enabled on two or more VMware ESX hosts.

This results in two or more ESX servers repeatedly re-discovering and resignaturing a datastore.

Network Address Authority (NAA) vs. LUN IDs

This behavior is prevented through the support and use of NAA for SAN LUNs with VMware ESX 3.5 Update 5 and later. We introduced the supportability for NAA ID's with ESX 3.5 GA. However, we stopped using the LUN ID to reference the device from Update 5 onward.

If the SAN array or LUNs do not support NAA, the SAN presentation of LUN IDs must be uniform or consistent across all hosts/initiators. Thus versions of VMware ESX prior to 3.5 Update 5 require uniform LUN IDs and storage presentation. See VMFS resignaturing in the Additional Information section for more information.

Resignature thrashing due to non-uniform presentation

With non-uniform presentation, resignature thrashing occurs if:

LUN NAA is not supported or enabled.
LUN IDs were not uniform across all hosts in the cluster. For example:

Datastore A is presented to Host A as the LUN ID 1 and as LUN ID 15 to Host B.
A rescan task is requested on Host B, which has LVM.enableResignature set to 1. Upon discovery of the LUN during rescan, this host determines that the datastore's current presentation does not match its written on-disk LUN ID, and deems a resignature is necessary. As configured, the host resignatures the VMFS datastore as LUN 15.
In VMware vSphere 4.x, this prompts the remainder of the cluster to rescan for storage devices. A rescan task is propagated to the remainder of the cluster, ensuring that all hosts are able to be updated on the current storage presentation.
However, Host A still has the datastore presented as LUN ID 1. As Host A discovers the device has an on-disk signature of LUN ID 15, it deems it necessary to resignature, writing an on-disk LUN ID of 1 again. This yet again prompts the remainder of the cluster to rescan, including Host B.
Host B rescans its storage and the process repeats between Host A and B until manual intervention.

Resignature Thrashing occurring despite corrected presentation

Depending on how the non-uniform presentation has been corrected, resignature thrashing can be triggered if:

LUN IDs were not uniform across all hosts in the cluster. For example, "Datastore A" is presented to Host A as the LUN ID 1 and as LUN ID 15 to Host B.
LUN IDs were made consistent for Host B, ensuring that the device is presented as LUN 1, instead of 15. This was, however, performed without removing the device first:
1. Modify presentation of LUN 15 and change its ID from 15 to 1.
2. Rescan Host B.
3. The /var/log/vmkernel log contains an entry similar to:
  
  WARNING: LinSCSI: 4371: The physical media represented by vmhba1:0:1 has changed and the device is in use. The device cannot be re-synchronized with the system. This is a critical error.
4. The LUN was in use by one or more components on VMware ESX and it could not be unmapped and re-mapped in memory or data structure. The host should be rebooted to re-map all devices and correct this condition. See the Additional Information section for best practice, or supported steps for LUN re-presentation on VMware ESX.
After the presentation changes, above, a rescan task is manually requested on Host B, which has LVM.enableResignature set to 1.

Note: At this stage, Host B was unable to re-synchronize the device correctly, as it was seemingly lost or replaced by a different device prior to its rescan.

Upon discovery of the LUN during rescan, this host determines that the datastore's current presentation (seen as LUN ID 15 still, due to non-synchronization) does not match its written (on-disk) LUN ID of 1 (by Host A), and deems a resignature is required.

Host B resignatures the VMFS datastore as LUN 15, which it perceives to be the current LUN ID.
Per design in VMware vSphere 4.x, this prompts the other server(s) to rescan for storage devices. A rescan task is propagated to the remainder of the cluster, ensuring that all hosts are able to be updated on the current storage presentation.
Host A has the datastore correctly presented as LUN ID 1, but discovers the device again with an on-disk LUN ID of 15, indicating that the device needs to be resignatured.
Host A completes another resignature to write an on-disk ID of 1 again. This prompts the remainder of the cluster to rescan, including Host B.
Host B rescans its storage and the process repeats between host A and B until manual intervention.

Stopping the resignature loop

Stop the resignaturing loop condition as soon as possible to prevent interruption in your environment.

To stop the resignaturing loop:

Log into the ESX host's terminal directly, or by SSH.
- For ESX hosts, press Alt+F1 in its console or your remote System Management Interface and log in as root. See Connecting to an ESX host using a SSH client (1019852).
- For ESXi hosts, see Using Tech Support Mode in ESXi 4.1 (1017910).
Run esxcfg-advcfg -s 0 /LVM/EnableResignature to disable resignaturing.
Repeat this process for the remainder of the ESX host cluster.

Closing recommendations

After stopping the resignature thrashing, correct the underlying triggers:

Verify your SAN LUN presentation and ensure it is uniform across all hosts/initiators in the cluster.
Verify if Namespace Addressing (NASA/NAA, EUI, etc) are supported by your SAN LUNs and enable a Namespace Address scheme (such as NAA, EUI, etc), if possible.
Conform to your SAN array vendor's best practices and documentation for LUN presentation or Namespace Addressing changes. If you have any questions or issues, contact your SAN array vendor for assistance.

Additional Information

VMFS resignaturing

When a VMFS datastore is formatted, a disk ID is written to the volume, assuring the device and datastore are original. Should the VMFS datastore be snapshotted or cloned, the snapshot/or clone's disk ID differs from the on-disk signature that was originally recorded. It is a different storage device.

VMware ESX relies upon Disk IDs to differentiate an original device from a snapshot or clone and does not mount snapshot devices without force-mounting or resignaturing the clone/snapshot VMFS datastore. For more information, see Snapshot LUN detection in ESX 3.x and ESX 4 (1011385).

However, this mechanism can be inadvertently triggered against original devices, as opposed to clone or snapshot LUNs, if the presentation of a LUN is not consistent between two ESX hosts or initiators, and the LUN does not support Namespace Addressing. One of the two servers will determine the device is a Snapshot LUN and not mount it.

LUN presentation changes on VMware ESX

LUN presentation on all versions of ESX requires this process:

Quiesce I/O to the device. Stop the virtual machines or vMotion them to another host, if only this server's presentation of the LUN is being changed.
After quiescing I/O to the device, see Removing a LUN containing a datastore from VMware ESXi/ESX 4.x (1029786) and unpresent the LUN from the ESX host.
Complete your presentation changes and present the LUN to the host. Ensure the LUN ID is consistent and/or namespace addressing is used.
Rescan the server and verify the LUN's size is correct, the datastore has mounted, and resignaturing or forced-mounting is not required.

SRM consideration

Site Recovery Manager (SRM) deployments configure ESX servers with LVM.enableResignature set to 1 when performing test failovers or failovers. SAN arrays that have not enabled or do not support Namespace Addressing with non-uniform presentation will trigger this repeating condition when SRM performs a test failover or failover.

This is the only immediately-known automated cause for enabling the LVM.enableResignature flag on one or more hosts.
Removing a LUN containing a datastore from VMware ESXi/ESX 4.0 and 4.1
VMFS 再署名によって複数の VMware ESXi 4.x/5.x ホストと ESX 4.x ホストの間のスラッシングが起こる

Feedback

thumb_up Yes

thumb_down No