MSCS shared RDM's temporarily offline upon SAN storage controller failover.
search cancel

MSCS shared RDM's temporarily offline upon SAN storage controller failover.

book

Article ID: 338526

calendar_today

Updated On:

Products

VMware VMware vSphere ESXi

Issue/Introduction

Symptoms:
When running a Windows Cluster Across Boxes configuration and a backend storage failover event is triggered, the Windows RDM LUNs will be temporarily reported offline, thus resulting in a disruption to the Windows Cluster configuration.

Once a MSCS VM is vMotioned, no matter when, and a storage controller reboot is performed the issue will happen.



The Windows quorum disk goes into a failed state after the failover is initiated. Quorum reports failure to update configuration:

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: MM/DD/YYYY HH:MM:SS
Event ID: 1557
Task Category: Quorum Manager
Level: Error
Keywords:
User: SYSTEM
Computer: ########
Description:
Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"
  <System>
    <Provider Name="Microsoft-Windows-FailoverClustering" Guid="{########-####-####-####-############}" />
    <EventID>1557</EventID>
    <Version>0</Version>
    <Level>2</Level>
    <Task>42</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000000</Keywords>
    <TimeCreated SystemTime="YYYY-MM-DDTHH:MM:SS.Z" />
    <EventRecordID>189031</EventRecordID>
    <Correlation />
    <Execution ProcessID="1###" ThreadID="8###" />
    <Channel>System</Channel>
    <Computer>########</Computer>
    <Security UserID="S-1-#-##" />
  </System>
  <EventData>
    <Data Name="NodeName">#######</Data>
  </EventData>
</Event>

At the same time the ClusterLog reports this failure as well and the Witness is detached. The witness disk is detached due to the failure:

000065c.000022b0::YYYY/MM/DD-HH:MM:SS INFO [DM] Node 1: DetachWitness
0000065c.000022b0::YYYY/MM/DD-HH:MM:SS INFO [QUORUM] Node 1: Witness detached due to update failure
0000065c.000022b0::YYYY/MM/DD-HH:MM:SS ERR [DM] Error while restoring (refreshing) the hive: (c000017d), registry: \Registry\Machine\0.Cluster
0000065c.00001758::YYYY/MM/DD-HH:MM:SS INFO [QUORUM] Node 1: Witness Failed Gum Handler [QUORUM] Node 1
0000065c.00001758::YYYY/MM/DD-HH:MM:SS INFO [QUORUM] Node 1: witness attach failed. next restart will happen at YYYY/MM/DD-HH:MM:SS
0000065c.000022b0::YYYY/MM/DD-HH:MM:SS ERR [QUORUM] Node 1: Failing quorum resource due to witness failure
0000065c.00001758::YYYY/MM/DD-HH:MM:SS INFO [GUM] Node 1: executing request locally, gumId:74547, my action: qm/witness-failed, # of updates: 1
0000065c.00001758::YYYY/MM/DD-HH:MM:SS INFO [QUORUM] Node 1: Witness Failed Gum Handler [QUORUM] Node 1
0000065c.00001758::YYYY/MM/DD-HH:MM:SS INFO [QUORUM] Node 1: witness attach failed. next restart will happen at YYYY/MM/DD-HH:MM:SS
0000065c.000022b0::YYYY/MM/DD-HH:MM:SS INFO [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'Cluster Disk 1', gen(2) result 0/0.
[...]
0000065c.000022b0::YYYY/MM/DD-HH:MM:SS INFO [RCM] Res Cluster Disk 1: Online -> ProcessingFailure( StateUnknown )
0000065c.000022b0::YYYY/MM/DD-HH:MM:SS INFO [RCM] TransitionToState(Cluster Disk 1) Online-->ProcessingFailure.
0000065c.000022b0:: YYYY/MM/DD-HH:MM:SS INFO [RCM] rcm::RcmGroup::UpdateStateIfChanged: (Cluster Group, Online --> Pending)
0000065c.000022b0:: YYYY/MM/DD-HH:MM:SS ERR [RCM] rcm::RcmResource::


The ESXi hosts first logs the failover by reporting the RSCN that would have been triggered when the storage controller TPG's were disabled.
For example, RSCNs logged by a lpfc driver:

YYYY-MM-DDTHH:MM:SS.Z cpu##:33906)lpfc: lpfc_els_rcv_rscn:5957: 1:(0):5973 RSCN received event x0 : Address format x00 : DID x010600
[...]
YYYY-MM-DDTHH:MM:SS.Z cpu##:33903)lpfc: lpfc_els_rcv_rscn:5957: 0:(0):5973 RSCN received event x0 : Address format x00 : DID x010600
[...]

Once the above starts happening, as expected, the host will log the 10 second 'devloss' timer start:

YYYY-MM-DDTHH:MM:SS.Z cpu##:33903)lpfc: lpfc_start_devloss:4156: 0:(0):3248 Start 10 sec devloss tmo WWPN ##:##:##:##:##:##:##:## NPort x010600
[...]
YYYY-MM-DDTHH:MM:SS.Z cpu##:33903)lpfc: lpfc_start_devloss:4156: 0:(0):3248 Start 10 sec devloss tmo WWPN ##:##:##:##:##:##:##:## NPort x020600

After the 'devloss' timer expires the WWPNs are removed from available use:
[...]
YYYY-MM-DDTHH:MM:SS.Z cpu##:33906)WARNING: lpfc: lpfc_dev_loss_tmo_handler:344: 1:(0):0203 Devloss timeout on WWPN ##:##:##:##:##:##:##:## NPort x010600 Data: x8 x8 x5
YYYY-MM-DDTHH:MM:SS.Z cpu##:33906)WARNING: lpfc: lpfc_dev_loss_tmo_handler:385: 1:(0):3298 ScsiNotifyPathStateChangeAsyncSAdapter Num x1 TID x2, DID x010600.
[...]



The Quorum LUN reports a "FAILOVER" action and then recovers:

[...]
YYYY-MM-DDTHH:MM:SS.Z cpu##:33651)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x2a (0x43ba80e8ca80, 37714) to dev "naa.624############" on path "vmhba1:C0:T1:L1##" Failed: H:0x1 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0. Act:FAILOVER
YYYY-MM-DDTHH:MM:SS.Z cpu##:33651)WARNING: NMP: nmp_DeviceRetryCommand:133: Device "naa.624############": awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
YYYY-MM-DDTHH:MM:SS.Z cpu##:33651)WARNING: NMP: nmp_DeviceStartLoop:725: NMP Device "naa.624############" is blocked. Not starting I/O from device.
[...]
YYYY-MM-DDTHH:MM:SS.Z cpu##:33956)WARNING: NMP: nmpDeviceAttemptFailover:603: Retry world failover device "naa.624############" - issuing command 0x43ba80e8ca80
YYYY-MM-DDTHH:MM:SS.Z cpu##:34039)NMP: nmpCompleteRetryForPath:325: Retry world recovered device "naa.624############"
[...]




Environment

VMware vSphere ESXi 6.0
VMware vSphere ESXi 6.5

Resolution

The fix is basically to clear the reservation state from the memory of source ESX server then the VM is vmotion'ed.

Workaround:
Avoid vMotion of MSCS VM's.