Symptoms:
When running a Microsoft Windows Failover Clustering (WFC) instance in a VMware ESXi Cluster Across Box (CAB) configuration, and using shared physical mode Raw Device Mapping (RDM), Windows event logger reports a critical error in the system logs during faults.
- A SAN storage controller fault or a redundant target port failure might trigger an unexpected failover of WFC resources
- The Windows 2008 and Windows 2012 or Windows 2012 R2 system event logs show these critical/error/warning messages:
- Windows 2008:
Event ID: 1135
Cluster node 'node name' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
Event ID: 1069
Cluster resource 'Cluster Disk # in clustered service or application 'Cluster Group' failed.
Event ID: 1177
The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.
Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.
Event ID: 7024
The Cluster Service service terminated with service-specific error A quorum of cluster nodes was not present to form a cluster.
Event ID: 7031
The Cluster Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 60000 milliseconds: Restart the service.
- Windows 2012 or Windows 2012 R2 :
Event ID: 140
The system failed to flush data to the transaction log. Corruption may occur in VolumeId: X:, DeviceName: \Device\HarddiskVolume#.
({Device Busy}
The device is currently busy.)
Event ID: 1038
Ownership of cluster disk 'Cluster Disk #' has been unexpectedly lost by this node. Run the Validate a Configuration wizard to check your storage configuration.
Event ID: 1069
Cluster resource 'Cluster Disk #' of type 'Physical Disk' in clustered role 'X:' failed. The error code was '0xaa' ('The requested resource is in use.').
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.
- In the /var/log/vmkernel.log files on the ESXi host, you see similar warning messages:
WARNING: NMP: nmpUpdatePReservationOnFailover:1264: Unable to check for matching key on failover for device "naa.600a098044306879702b454e48496232"
WARNING: NMP: nmp_DeviceUpdatePathStates:886: Could not drop reservation on failover for NMP device "naa.600a098044306879702b454e48496232".
WARNING: NMP: nmpDeviceAttemptFailover:603: Retry world failover device "naa.600a098044306879702b454e48496232" - issuing command 0x439dc97ccd80
Note: This unexpected resource failover may affect the application or I/O running on the failover cluster.
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.