One host shows a Storage Initiator Error while all other hosts show SCSI Reservation Conflicts

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

The logs from a single ESXi/ESX host reveal that the HBA driver is aborting commands that did not complete:

Apr 29 21:07:31 esx69 vmkernel: 150:13:10:17.572 cpu5:1519)<4>lpfc1:0712:FPe:SCSI layer issued abort device Data: x0 x14 Apr 29 21:07:33 esx69 vmkernel: 150:13:10:18.844 cpu5:1519)<4>lpfc1:0749:FPe:Completed Abort Task Set Data: x0 x14 x128
You see constant Storage Initiator Error messages:

Apr 29 21:07:33 esx69 vmkernel: 150:13:10:18.844 cpu5:1029)StorageMonitor: 196: vmhba0:0:20:0 status = 0/7 0x0 0x0 0x0 ... Apr 29 21:13:21 esx69 vmkernel: 150:13:16:07.099 cpu15:1055)WARNING: SCSI: 2933: CheckUnitReady on vmhba1:0:18 returned Storage initiator error 0x7/0x0 sk 0x0 asc 0x0 ascq 0x0
On ESXi 5.1, you may also see these error messages:

2012-12-08T17:36:07.197Z cpu18:8229)WARNING: HBX: 2829: Reclaiming timed out [HB state abcdef02 offset 3588096 gen 17 stampUS 21555940324 uuid 50c32720-3951a3de-9237-001e6727086e jrnl <FB 266000> drv 14.58] on vol 'XXXXX' failed: No connection
These lead to reservation requests timing out, among other errors:

Apr 29 22:07:57 esx69 vmkernel: 150:14:10:43.028 cpu10:1051)WARNING: FS3: 4787: Reservation error: Timeout .. Apr 29 22:09:46 esx69 vmkernel: 150:14:12:32.817 cpu5:1048)WARNING: ScsiDevice: 3264: Failed for vml.02001300006006048000019010336153303030344353594d4d4554: Timeout Apr 29 22:09:46 esx69 vmkernel: 150:14:12:32.817 cpu5:1048)WARNING: ScsiDevice: 3633: READ CAPACITY on device "vml.02001300006006048000019010336153303030344353594d4d4554" from Plugin "legacyMP" failed. Timeout Apr 29 22:09:46 esx69 vmkernel: 150:14:12:32.817 cpu5:1048)LVM: 2375: Could not open device vml.02001300006006048000019010336153303030344353594d4d4554:1, vol [49381c21-637d4334-6333-001cc4efceaa, 49381c21-637d4334-6333-001cc4efceaa, 1]: Failure
It is common to see the kernel spawn helper worlds to assist with long running operations:

Apr 29 22:09:54 esx69 vmkernel: 150:14:12:40.113 cpu0:1024)WARNING: Helper: 1403: cancel request handle=1 fn=0x68ea3c Apr 29 22:09:54 esx69 vmkernel: 150:14:12:40.113 cpu0:1024)Helper: 826: Dumping non-active requests on HELPER_MISC_QUEUE at 38167427784153540 Apr 29 22:09:54 esx69 vmkernel: 150:14:12:40.113 cpu0:1024)Helper: 826: Dumping active requests on HELPER_MISC_QUEUE at 38167427784212441 Apr 29 22:09:54 esx69 vmkernel: 150:14:12:40.113 cpu0:1024)Helper: 833: 0: status=2 func=0x68ea3c since=38167046941113441 Apr 29 22:09:54 esx69 vmkernel: 150:14:12:40.114 cpu0:1024)VMNIX: VmkCall: 351: Waiting for long-running helper request
These messages continue until the host is rebooted:

Apr 29 23:10:20 esx69 vmkernel: 150:15:13:06.683 cpu0:1024)VMNIX: VMFS: 118: timed out Apr 29 23:10:38 esx69 vmkernel: 150:15:13:24.532 cpu15:1046)VMKAcpi: 1878: In PowerButton Helper Apr 29 23:14:25 esx69 vmkernel: TSC: 0 cpu0:0)Init: 386: cpu 0: early measured tsc speed 2933437932 Hz
All other hosts report SCSI reservation conflicts at the moment this single host started aborting I/O and reporting a Storage Initiator Error:

Apr 29 21:07:55 esx68 vmkernel: 150:07:58:19.858 cpu5:1051)SCSI: vm 1051: 109: Sync CR at 64 Apr 29 21:07:56 esx68 vmkernel: 150:07:58:20.819 cpu5:1051)SCSI: vm 1051: 109: Sync CR at 48 Apr 29 21:07:56 esx68 vmkernel: 150:07:58:21.085 cpu10:1379)VSCSIFs: 441: fd 193155182 status Busy Apr 29 21:07:57 esx68 vmkernel: 150:07:58:21.899 cpu5:1051)SCSI: vm 1051: 109: Sync CR at 32 Apr 29 21:07:58 esx68 vmkernel: 150:07:58:22.923 cpu5:1051)SCSI: vm 1051: 109: Sync CR at 16 Apr 29 21:07:59 esx68 vmkernel: 150:07:58:23.874 cpu5:1051)SCSI: vm 1051: 109: Sync CR at 0 Apr 29 21:07:59 esx68 vmkernel: 150:07:58:23.874 cpu5:1051)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts Apr 29 21:07:59 esx68 vmkernel: 150:07:58:23.874 cpu5:1051)WARNING: FS3: 4787: Reservation error: SCSI reservation conflict Apr 29 21:07:59 esx68 vmkernel: 150:07:58:23.874 cpu5:1051)WARNING: FS3: 4981: Reclaiming timed out heartbeat [HB state abcdef02 offset 3938816 gen 13037 stamp 12988698946510 uuid 4b142703-2c3bc54e-ad08-0025b3e16d38 jrnl <FB 100052> drv 4.31] failed: SCSI reservation conflict
The SCSI reservation conflict messages on all the other ESXi/ESX hosts stop at the moment the host is rebooted:

Apr 29 23:10:43 esx68 vmkernel: 150:10:01:07.626 cpu5:1051)SCSI: vm 1051: 109: Sync CR at 16 Apr 29 23:10:43 esx68 vmkernel: 150:10:01:08.290 cpu5:1051)FS3: 4831: Reclaimed timed out heartbeat [HB state abcdef02 offset 3938816 gen 13037 stamp 12988683869135 uuid 4b142703-2c3bc54e-ad08-0025b3e16d38 jrnl <FB 100052> drv 4.31]

Environment

VMware ESXi 4.1.x Embedded
VMware vSphere ESXi 5.0
VMware ESXi 4.0.x Installable
VMware ESXi 3.5.x Installable
VMware ESX 4.1.x
VMware ESXi 4.0.x Embedded
VMware ESXi 3.5.x Embedded
VMware vSphere ESXi 5.1
VMware vSphere ESXi 5.5
VMware ESX Server 3.5.x
VMware ESXi 4.1.x Installable
VMware ESX 4.0.x

Resolution

In this scenario, the HBA goes into an internal fatal error state.

This is the sequence of events:

ESX host X's HBA sends a SCSI reserve command to the array to reserve LUN Y.
Array places reservation on LUN Y for the initiator.
ESX host X's HBA goes into internal fatal error state and no longer responds to/issues commands, particularly a SCSI release.
Driver starts to abort commands in flight since they did not complete before the timeout period.
ESX SCSI mid layer reports a Storage Initiator Error when trying to communicate with the HBA.
No hosts can complete I/O requests to LUN Y since it has a reservation on it for the initiator in ESX host X.
All hosts return a SCSI Reservation Conflict status from the array when attempting to issue I/O to LUN Y.
ESX host X is shut down/rebooted.
The reboot causes the HBAs in that host to drop from the fabric.
The array sees the HBAs drop off the fabric and proceeds to clear any pending reservations from those initiators.
The other ESX hosts are now able to do I/O to LUN Y and are no longer getting SCSI Reservation Conflict statuses.

From the logs, you can see that all of the errors are reported from one HBA, which is the second HBA. If this issue happens repeatedly for the same HBA/initiator, it is possible that the HBA is faulty and should be replaced. This behavior may also occur if the array controller is overloaded, if FCP frames are getting dropped, or if the ESX hosts are blades connecting to a blade switch with outdated firmware.

As a temporary workaround, you can attempt to issue a LUN reset to the LUN in question to see if this releases the SCSI reserve on the LUN:

vmkfstools -L lunreset /vmfs/devices/disks/device_ID

Additional Information

1 つのホストがストレージイニシエータエラーを示し、他のすべてのホストが SCSI 予約の競合を示す
一台主机显示存储启动器错误，而所有其他主机显示 SCSI 预留冲突