This issue only occurs when a combination of all of these points take place:
- Disabled or no SAN array support for LUN Network Address Authority
- Non-uniform LUN presentation or non-uniform host-side device registration for the initiators or hosts.
- LVM.enableResignature is enabled on two or more VMware ESX hosts.
This results in two or more ESX servers repeatedly re-discovering and resignaturing a datastore.
Network Address Authority (NAA) vs. LUN IDs
This behavior is prevented through the support and use of NAA for SAN LUNs with VMware ESX 3.5 Update 5 and later. We introduced the supportability for NAA ID's with ESX 3.5 GA. However, we stopped using the LUN ID to reference the device from Update 5 onward.
If the SAN array or LUNs do not support NAA, the SAN presentation of LUN IDs must be uniform or consistent across all hosts/initiators. Thus versions of VMware ESX prior to 3.5 Update 5 require uniform LUN IDs and storage presentation. See VMFS resignaturing in the Additional Information section for more information.
Resignature thrashing due to non-uniform presentation
With non-uniform presentation, resignature thrashing occurs if:
- LUN NAA is not supported or enabled.
- LUN IDs were not uniform across all hosts in the cluster. For example:
Datastore A is presented to Host A as the LUN ID 1 and as LUN ID 15 to Host B.
- A rescan task is requested on Host B, which has LVM.enableResignature set to 1. Upon discovery of the LUN during rescan, this host determines that the datastore's current presentation does not match its written on-disk LUN ID, and deems a resignature is necessary. As configured, the host resignatures the VMFS datastore as LUN 15.
- In VMware vSphere 4.x, this prompts the remainder of the cluster to rescan for storage devices. A rescan task is propagated to the remainder of the cluster, ensuring that all hosts are able to be updated on the current storage presentation.
- However, Host A still has the datastore presented as LUN ID 1. As Host A discovers the device has an on-disk signature of LUN ID 15, it deems it necessary to resignature, writing an on-disk LUN ID of 1 again. This yet again prompts the remainder of the cluster to rescan, including Host B.
- Host B rescans its storage and the process repeats between Host A and B until manual intervention.
Resignature Thrashing occurring despite corrected presentation
Depending on how the non-uniform presentation has been corrected, resignature thrashing can be triggered if:
- LUN IDs were not uniform across all hosts in the cluster. For example, "Datastore A" is presented to Host A as the LUN ID 1 and as LUN ID 15 to Host B.
- LUN IDs were made consistent for Host B, ensuring that the device is presented as LUN 1, instead of 15. This was, however, performed without removing the device first:
- Modify presentation of LUN 15 and change its ID from 15 to 1.
- Rescan Host B.
- The /var/log/vmkernel log contains an entry similar to:
WARNING: LinSCSI: 4371: The physical media represented by vmhba1:0:1 has changed and the device is in use. The device cannot be re-synchronized with the system. This is a critical error.
- The LUN was in use by one or more components on VMware ESX and it could not be unmapped and re-mapped in memory or data structure. The host should be rebooted to re-map all devices and correct this condition. See the Additional Information section for best practice, or supported steps for LUN re-presentation on VMware ESX.
- After the presentation changes, above, a rescan task is manually requested on Host B, which has LVM.enableResignature set to 1.
Note: At this stage, Host B was unable to re-synchronize the device correctly, as it was seemingly lost or replaced by a different device prior to its rescan.
Upon discovery of the LUN during rescan, this host determines that the datastore's current presentation (seen as LUN ID 15 still, due to non-synchronization) does not match its written (on-disk) LUN ID of 1 (by Host A), and deems a resignature is required.
Host B resignatures the VMFS datastore as LUN 15, which it perceives to be the current LUN ID.
- Per design in VMware vSphere 4.x, this prompts the other server(s) to rescan for storage devices. A rescan task is propagated to the remainder of the cluster, ensuring that all hosts are able to be updated on the current storage presentation.
- Host A has the datastore correctly presented as LUN ID 1, but discovers the device again with an on-disk LUN ID of 15, indicating that the device needs to be resignatured.
- Host A completes another resignature to write an on-disk ID of 1 again. This prompts the remainder of the cluster to rescan, including Host B.
- Host B rescans its storage and the process repeats between host A and B until manual intervention.
Stopping the resignature loop
Stop the resignaturing loop condition as soon as possible to prevent interruption in your environment.
To stop the resignaturing loop:
- Log into the ESX host's terminal directly, or by SSH.
-
- Run esxcfg-advcfg -s 0 /LVM/EnableResignature to disable resignaturing.
- Repeat this process for the remainder of the ESX host cluster.
Closing recommendations
After stopping the resignature thrashing, correct the underlying triggers:
- Verify your SAN LUN presentation and ensure it is uniform across all hosts/initiators in the cluster.
- Verify if Namespace Addressing (NASA/NAA, EUI, etc) are supported by your SAN LUNs and enable a Namespace Address scheme (such as NAA, EUI, etc), if possible.
- Conform to your SAN array vendor's best practices and documentation for LUN presentation or Namespace Addressing changes. If you have any questions or issues, contact your SAN array vendor for assistance.