Fibre Channel switch outage leaves pending SCSI reservations on LUNs managed by an IBM SVC

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:
These symptoms may be observed as the result of a fibre channel outage on LUNs managed by an IBM SVC:

After a fibre channel outage, logs messages may show paths for an HBAs as marked offline:

vmkernel: 2:17:25:49.804 cpu7:4340)<3> rport-5:0-0: blocked FC remote port time out: saving binding
vmkernel: 2:17:25:49.804 cpu7:4340)<3> rport-5:0-1: blocked FC remote port time out: saving binding
When a fail over occurs there is no awareness that a prior reservation existed on the device. Therefore, no LUN reset is issued to clear a pending reserve. You may see log entries similar to these when this occurs:

cpu19:4150)WARNING: NMP: nmp_DeviceRetryCommand: Device "naa.60050768018080cdb8000000000004c5": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
cpu1:5750)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba1:C0:T4:L1 to vmhba2:C0:T3:L1 for device "naa.60050768018080cdb8000000000004c5".
The reservation conflict occurs after the fail over successfully completes. The VMkernel log shows messages from the IBM SVC SATP report similar to:

cpu12:4286)VMW_SATP_SVC: satp_svc_UpdatePath: Failed to update path "vmhba2:C0:T5:L1" state. Status=SCSI reservation conflict
This behavior is further confirmed if you are unable to determine the status of the reservation because the command did not complete.
You see log entries similar to:

cpu1:4620)NMP: nmp_CompleteCommandForPath: Command 0x16 (0x4102bf7b0840) to NMP device "naa.60050768018080cdb8000000000004c5" failed on physical path "vmhba1:C0:T4:L1" H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
cpu1:4620)NMP: nmp_PathDetermineFailure: SCSI cmd RESERVE failed on path vmhba1:C0:T4:L1, reservation state on device naa.60050768018080cdb8000000000004c5 is unknown.

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware ESXi 4.1.x Embedded
VMware ESX 4.0.x
VMware ESX 4.1.x
VMware vSphere ESXi 5.0
VMware ESXi 4.0.x Installable
VMware ESXi 3.5.x Installable
VMware ESXi 4.0.x Embedded
VMware ESXi 4.1.x Installable
VMware ESXi 3.5.x Embedded
VMware vSphere ESXi 5.1
VMware vSphere ESXi 5.5
VMware ESX Server 3.5.x

Cause

The reservation conflict occurs when SCSI reservations are left on the LUNs for the HBA that is no longer logged into the fabric.

This happens when the fibre channel goes offline at the time the SCSI reserve commands are sent to the array. The SCSI reserve command reaches the array and places the reservation, but the acknowledgment of the SCSI command never makes it back to the host because the fibre channel is now offline and a LUN reset is not issued on path fail over.

This behavior may occur when:

IBM SVC is running firmware prior to 5.1

or
A hard crash of the fibre channel in conjunction with IBM SVC (any firmware revision)

Resolution

To resolve this issue:

Update the IBM SVC firmware to version 7.1 or later per IBM recommendations. Starting with SVC firmware 5.1, code was introduced to clear pending SCSI-2 reservations when an initiator logs off the fabric. The latest IBM recommendation includes this change.
Run with the additional ESXi multipath reset_on_attempted_reserve option enabled to force ESXi to conditionally reset reservation to a LUN to prevent a stuck reservation as a result of a hard crash of the fibre channel switch prior to attempting a path failover. This option enables ESXi to track attempted (SCSI-2) reservations, and if path failover is triggered with reservation on the path, NMP will send a LUN reset to clear it.

Note: This option is necessary to perform a LUN reset to the LUNs reporting a SCSI reservation conflict when running IBM SVC with any version of firmware that is dependent on receiving an RSCN (Registered State Change Notification). The RSCN indicates that devices have logged off the fabric. The reason you will need to perform a LUN reset is that sending RSCNs as a result of a hard crash is not</u> guaranteed.

To enable the reset_on_attempted_reserve option on the Claim Rule for your IBM Storage Array, issue these esxcli commands:

For ESXi 5.5/5.1/5.0
1. esxcli storage nmp satp rule remove --satp VMW_SATP_ALUA --vendor IBM --model 2145
2. esxcli storage nmp satp rule add --satp VMW_SATP_ALUA --psp VMW_PSP_RR --vendor IBM --model 2145 --option reset_on_attempted_reserve
For ESXi 5.0 / 5.1: with MSCS configuration:
1. esxcli storage nmp satp rule remove --satp VMW_SATP_SVC --vendor IBM --model 2145
2. esxcli storage nmp satp rule add --satp VMW_SATP_SVC --psp VMW_PSP_FIXED --vendor IBM --model 2145 --option reset_on_attempted_reserve

Note: The esxcli commands for ESX/ESXi 4.1 are slightly different:

Delete the existing rule:

esxcli nmp satp deleterule --vendor IBM --model 2145 --satp VMW_SATP_SVC
Then add the new one:

esxcli nmp satp addrule --satp VMW_SATP_ALUA --psp VMW_PSP_RR --vendor IBM --model 2145 --option reset_on_attempted_reserve

Additional Information

For more information about performing a LUN reset from and ESXi/ESX host, see Resolving SCSI reservation conflicts (1002293).

For more information about IBM Storage Array Firmware and recommended settings, contact your storage array vendor.
Resolving SCSI reservation conflicts
ファイバチャネルスイッチが停止し、IBM SVC によって管理される LUN で SCSI 予約が保留されたままになる
光纤通道交换机中断导致由 IBM SVC 管理的 LUN 上的 SCSI 预留处于挂起状态