Windows event log reports a critical error when running Microsoft Windows Failover Clustering on VMware ESXi Cluster Across Box

Products

VMware vSphere ESXi

Issue/Introduction

Symptoms:

When running a Microsoft Windows Failover Clustering (WFC) instance in a VMware ESXi Cluster Across Box (CAB) configuration, and using shared physical mode Raw Device Mapping (RDM), Windows event logger reports a critical error in the system logs during faults.

A SAN storage controller fault or a redundant target port failure might trigger an unexpected failover of WFC resources
The Windows 2008 and Windows 2012 or Windows 2012 R2 system event logs show these critical/error/warning messages:
- Windows 2008:
  
  Event ID: 1135
  Cluster node 'node name' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
  
  Event ID: 1069
  Cluster resource 'Cluster Disk # in clustered service or application 'Cluster Group' failed.
  
  Event ID: 1177
  The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
  Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
  
  Event ID: 7024
  The Cluster Service service terminated with service-specific error A quorum of cluster nodes was not present to form a cluster.
  
  Event ID: 7031
  The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service.
- Windows 2012 or Windows 2012 R2 :
  
  Event ID: 140
  The system failed to flush data to the transaction log. Corruption may occur in VolumeId: X:, DeviceName: \Device\HarddiskVolume#.
  ({Device Busy}
  The device is currently busy.)
  
  Event ID: 1038
  Ownership of cluster disk 'Cluster Disk #' has been unexpectedly lost by this node. Run the Validate a Configuration wizard to check your storage configuration.
  
  Event ID: 1069
  Cluster resource 'Cluster Disk #' of type 'Physical Disk' in clustered role 'X:' failed. The error code was '0xaa' ('The requested resource is in use.').
  
  Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
In the /var/log/vmkernel.log files on the ESXi host, you see similar warning messages:

WARNING: NMP: nmpUpdatePReservationOnFailover:1264: Unable to check for matching key on failover for device "naa.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"

WARNING: NMP: nmp_DeviceUpdatePathStates:886: Could not drop reservation on failover for NMP device "naa.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX".

WARNING: NMP: nmpDeviceAttemptFailover:603: Retry world failover device "naa.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" - issuing command 0x439dc97ccd80

Note: This unexpected resource failover may affect the application or I/O running on the failover cluster.

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

VMware vSphere ESXi 5.0
VMware vSphere ESXi 5.1
VMware vSphere ESXi 6.0
VMware vSphere ESXi 5.5

Resolution

This issue is resolved in:

VMware ESXi 6.0 Update 1b, available at VMware Downloads.
VMware ESXi 5.5 patch ESXi550-201512001. For more information, see VMware ESXi 5.5, Patch Release ESXi550-201512001 (2135410).

Additional Information

To be alerted when this document is updated, click the Subscribe to Article link in the Actions box..

For more information about Microsoft Clustering solutions running on VMware vSphere, see:

Microsoft Cluster Service (MSCS) support on ESXi/ESX
Microsoft Clustering on VMware vSphere: Guidelines for supported configurations
Microsoft Windows Server Failover Clustering (WSFC) with shared disks on VMware vSphere 6.x: Guidelines for supported configurations
Setup for Failover Clustering and Microsoft Cluster Service:
- vSphere 5.5
- vSphere 6.0

To view the Windows Event logs:

Windows 2012:

1. In the right pane of the Server Manager window, click Tools and select Event Viewer from the menu.
2. In the left pane of the Event Viewer window, go to Event Viewer (Local) > Windows Logs > System.

Windows 2008:

In the left pane of the Server Manager window, go to Server Manager > Diagnostics > Event Viewer > Windows Logs > System.

Microsoft Cluster Service (MSCS) support on ESXi/ESX
LUN filtering mechanism during RDM creation
Guidelines for Microsoft Clustering on vSphere

Impact/Risks:
This unexpected resource failover may affect the application or I/O running on the failover cluster. If this issue occurs, you must restart your application.