This article provides information about deploying a vSphere Metro Storage Cluster (vMSC) across two datacenters or sites using HP LeftHand Multi-Site storage. With vSphere 5.0, a Storage Virtualization Device can be supported in a Metro Storage Custer configuration.
vSphere Metro Storage Cluster (vMSC) is a new certified configuration for stretched storage cluster architectures. A vMSC configuration is designed to maintain data availability beyond a single physical or logical site. A storage device configured in the MSC configuration is supported after successful vMSC certification. All supported storage devices are listed on the VMware Storage Compatibility Guide.
HP LeftHand storage is a scale out, clustered, iSCSI storage solution. HP LeftHand Multi-Site is a feature of the LeftHand operating system, commonly known as SAN/iQ software, which is included with all HP LeftHand SANs. This technology allows for storage clusters to be stretched across sites to provide high availability beyond failure domains defined by the administrator. Traditionally, in Metro Storage Cluster configurations, these failure domains are distinct geographic locations. However,the technology can be used to protect against the failure of a logical site that may be a rack, room, or floor in the same building, as well as buildings within a campus or data centers that are separated by as much as 100 KM or more, provided the link satisfies the bandwidth and latency requirements established by VMware and HP.
The HP LeftHand Failover Manager (FOM) is a SAN/iQ component provisioned as a virtual machine that is typically deployed at a third site. In HP LeftHand Multi-Site solutions, the Failover Manager allows for access to the storage volumes to be maintained in the event of a site failure or inter-site link (ISL) failure.
The HP LeftHand Multi-Site solution uses SAN/iQ Network RAID technology to stripe two copies of data across a storage cluster. When deployed in a Multi-Site configuration, SAN/iQ ensures that a full copy of the data resides on each site, or each side of the cluster. In Multi-Site/vMSC configurations, data remains available in the event of a site failure of loss of link between sites.
The Inter-Site link is a crucial component of any vMSC solution. The minimum required bandwidth for HP LeftHand Multi-Site is 1 Gbps and latency should not exceed 2ms RTT for optimal performance. Larger configurations may require additional bandwidth for the ISL.
Scenario | HP LeftHand P4000 Array Behavior | VMware HA Behavior |
Single storage node single path failure | P4000 node path failover occurs. All volumes remain connected. All ESXi sessions remain active. | No impact observed |
ESXi Single storage path failure | No impact on volume availability. ESXi storage path fails over to the alternative path. All sessions remain active. | No impact observed |
Site-1 Single Storage node failure |
Volume availability remains unaffected. ESXi iSCSI sessions affected by node failure, failover to surviving nodes. After failed node comes back online, all affected volumes resync automatically. Quorum is maintained.
Note: Volumes associated with failed node may or may not show unprotected in Centralized management Console depending on the Data Protection Level configured for the volume. |
No impact observed |
Site-2 Single Storage node failure |
Volume availability remains unaffected. ESXi iSCSI sessions affected by node failure, failover to surviving nodes. After failed node comes back online, all affected volumes resync automatically. Quorum is maintained.
Note: Volumes associated with failed node may or may not show unprotected in Centralized management Console depending on the Data Protection Level configured for the volume. |
No impact observed |
Site-1 All storage node failure |
Volume availability remains unaffected. ESXi iSCSI sessions affected by node failure, failover to surviving nodes. After failed node comes back online, all affected volumes resync automatically. Quorum is maintained.
Note: Volumes associated with failed node may or may not show unprotected in Centralized management Console depending on the Data Protection Level configured for the volume. |
No impact observed |
Site-2 All storage node failure |
Volume availability remains unaffected. ESXi iSCSI sessions affected by node failure, failover to surviving nodes. After failed node comes back online, all affected volumes resync automatically. Quorum is maintained.
Note: Volumes associated with failed node may or may not show unprotected in Centralized management Console depending on the Data Protection Level configured for the volume. |
No impact observed |
Failover Manager Failure | No impact on volume availability. All sessions remain active. | No impact observed |
Complete Site 1 failure, including ESXi and storage arrays |
Volume availability remains unaffected. Quorum is maintained. iSCSI sessions to surviving ESXi nodes remain active. After failed node comes back online, all affected volumes resync automatically. | Virtual machines on failed ESXi nodes fail. HA restarts failed virtual machines on ESXi hosts on Site 2. |
Complete Site 2 failure, including ESXi and storage arrays |
Volume availability remains unaffected. Quorum is maintained. iSCSI sessions to surviving ESXi nodes remain active. After failed node comes back online, all affected volumes resync automatically. | Virtual machines on failed ESXi nodes fail. HA restarts failed virtual machines on ESXi hosts on Site 1. |
Single ESXi failure (shutdown) | No impact. Array continues to function normally. | Virtual machines on failed ESXi node fail. HA restarts failed virtual machines on surviving ESXi hosts. |
Multiple ESXi host management network failure | No impact. Array continues to function normally. |
No impact. As long is the storage heartbeat is on and virtual machines are accessible, HA does not initiate failover |
Single Storage Inter-Site Link failure |
No impact. Array continues to function normally. Note: Redundant Inter-Site Links for storage network are required for this use case. |
No Impact observed |
Site 1 and Site 2 simultaneous failure (shutdown) and restoration | Arrays boot up and resync. All volumes become available. All iSCSI sessions to ESXi hosts are re-established and virtual machines restarted successfully. As a best practice, P4000 arrays should be powered on first and allow the LUNs to become available before powering on the ESXi hosts. | No Impact observed |
Management ISL failure | No impact to P4000 array. Volumes remain available | If the HA host isolation response is set to Leave Powered On, virtual machines at each site continue to run as storage heartbeat is still active. Partitioned Hosts on site that does not have a Fault Domain Manager elect a new Primary. |
CMC-Management Server failure | No impact. Array continues to function normally. Array management functions however cannot be performed until CMC is up and running again. | No Impact observed |
vCenter Server failure | No impact. Array continues to function normally |
No Impact on HA. However, the DRS rules cannot be applied. |