Implementing vSphere Metro Storage Cluster using HP LeftHand Multi-Site
search cancel

Implementing vSphere Metro Storage Cluster using HP LeftHand Multi-Site

book

Article ID: 319511

calendar_today

Updated On: 02-18-2025

Products

VMware vSphere ESXi

Issue/Introduction

This article provides information about deploying a vSphere Metro Storage Cluster (vMSC) across two datacenters or sites using HP LeftHand Multi-Site storage. With vSphere 5.0, a Storage Virtualization Device can be supported in a Metro Storage Custer configuration.


Resolution

What is vMSC?

vSphere Metro Storage Cluster (vMSC) is a new certified configuration for stretched storage cluster architectures. A vMSC configuration is designed to maintain data availability beyond a single physical or logical site. A storage device configured in the MSC configuration is supported after successful vMSC certification. All supported storage devices are listed on the VMware Storage Compatibility Guide.

What is HP LeftHand Multi-Site?

HP LeftHand storage is a scale out, clustered, iSCSI storage solution. HP LeftHand Multi-Site is a feature of the LeftHand operating system, commonly known as SAN/iQ software, which is included with all HP LeftHand SANs. This technology allows for storage clusters to be stretched across sites to provide high availability beyond failure domains defined by the administrator. Traditionally, in Metro Storage Cluster configurations, these failure domains are distinct geographic locations. However,the technology can be used to protect against the failure of a logical site that may be a rack, room, or floor in the same building, as well as buildings within a campus or data centers that are separated by as much as 100 KM or more, provided the link satisfies the bandwidth and latency requirements established by VMware and HP.

HP LeftHand Failover Manager

The HP LeftHand Failover Manager (FOM) is a SAN/iQ component provisioned as a virtual machine that is typically deployed at a third site. In HP LeftHand Multi-Site solutions, the Failover Manager allows for access to the storage volumes to be maintained in the event of a site failure or inter-site link (ISL) failure.

The HP LeftHand solution employs a distributed, clustered approach to storage and, as such, the SAN/iQ software uses a mechanism called quorum to manage consistency between individual storage nodes. Quorum is controlled by some or all of the nodes in a SAN/iQ storage Management Group. Nodes that participate in the quorum process are called Managers. A Management Group can contain more than one storage cluster and Multi-Site configurations can be deployed with multiple clusters within a Multi-Site Management Group or with a combination of Multi-Site and single site clusters within a Management Group. For data to be accessible in an HP LeftHand system, more than half of the system Managers must be on line. In Multi-Site cluster configurations where each site contains exactly half of the nodes, a Failover Manager in a third site is used to maintain quorum and allow data access to be preserved in cases of site or ISL failure. The FOM does not actively participate in data storage and a failure or removal of a FOM from an otherwise functioning environment will have no impact. The FOM only comes into play when one site or the ISL has failed or if two system Managers have failed simultaneously.

Configuration Requirements

These requirements must be satisfied to support a vMSC configuration with HP LeftHand:
 
  • ESXi hosts in vMSC configurations should be configured with at least two distinct, isolated IP networks. One of these networks should be dedicated as the storage network. The storage network will handle iscsi traffic between ESXi hosts and the LeftHand SAN as well as replication traffic between the storage nodes in the cluster to support Network RAID replication. The second network (VM network) will support virtual machine traffic as well management functions for the ESXi hosts. Users may choose to configure additional networks for other functionality such as vMotion. This is recommended as a best practice but is not a strict requirement of a Multi-Site/vMSC configuration. Additionally, users may choose to further sort the IP traffic by separating host management from virtual machine traffic for example.
  • The maximum round trip latency on the storage network between sites should not exceed 2 milliseconds (ms) RTT.
  • The storage network must support a minimum of 1gbps throughput between the sites. Please refer to the HP LeftHand Multi-Site User’s Guide for details on recommended sizing for inter-site links in Multi-Site configurations.
  • Network connectivity between the FOM and the storage nodes should support bandwidth of at least 100 mbps and round trip latency should not exceed 50ms RTT.
  • The ESXi hosts in both data centers must have a private network on the same IP subnet and broadcast domain.
  • Any IP subnet used by the virtual machine must be accessible from ESXi hosts in both datacenters. This requirement is important so that clients accessing virtual machines running on ESXi hosts on both sides are able to function smoothly upon any VMware HA triggered virtual machine restart events.
  • When there is one or more node failure at the backend, the I/O response time must be less than 60 seconds
  • For vMSC certified configurations, sites should be connected via a redundant storage network consisting of two physical links.
  • The data storage locations, including the boot device used by the virtual machines, must be active and accessible from ESXi hosts in both datacenters.
  • vCenter Server must be able to connect to ESXi hosts in both datacenters.
  • The VMware datastores for the virtual machines running in the ESXi Cluster are provisioned on Network RAID-10 volumes.
  • vMSC configurations with HP LeftHand should use single subnet, single VIP network design
  • The maximum number of hosts in the HA cluster must not exceed 32 hosts.  
Notes:
  • An HP LeftHand Failover Manager virtual machine should be configured in a third site and must be able to communicate with the LeftHand storage nodes at both sides of the cluster. To survive the total failure of either site in a two-site Multi-Site configuration, a FOM must be deployed in a third site.
  • vMSC certification testing for HP LeftHand was conducted with SAN/iQ 9.5 and ESXi 5.0.
  • This document describes requirements and supported configurations specifically for HP LeftHand Multi-Site in a vMSC environment. HP may support Multi-Site configurations beyond those outlined in this document.
  • All management, vMotion, and VM networks should be configured per VMware best practices.
 

Solution Overview

The HP LeftHand Multi-Site solution uses SAN/iQ Network RAID technology to stripe two copies of data across a storage cluster. When deployed in a Multi-Site configuration, SAN/iQ ensures that a full copy of the data resides on each site, or each side of the cluster. In Multi-Site/vMSC configurations, data remains available in the event of a site failure of loss of link between sites.

A VMware HA/DRS cluster is created across the two sites using ESXi 5.0 hosts and managed by VMware vCenter Server 5.0. The vSphere Management, vMotion, and virtual machine networks are connected using a redundant network between the two sites. It is assumed that vCenter Server managing the HA/DRS cluster can connect to the ESXi hosts at both sites. For a high level overview, see this diagram:
 
 

Managing Inter-Site Links

The Inter-Site link is a crucial component of any vMSC solution. The minimum required bandwidth for HP LeftHand Multi-Site is 1 Gbps and latency should not exceed 2ms RTT for optimal performance. Larger configurations may require additional bandwidth for the ISL. 

The Multi-Site configuration for vMSC certification employs dual, redundant physical links for the ISL. This configuration provides the highest level of resiliency for Multi-Site configurations.However, in some cases, it is possible that both links may fail. If a LeftHand Multi-Site cluster becomes partitioned and the link or links between sites are completely severed, the Failover Manager maintains quorum by ‘siding’ with the nodes at one of the two storage sites. This site maintains data access while access at the second site is suspended until the ISL is restored. The side at which access is maintained in these scenarios can be configured by the administrator via the primary site designation in the Site Configuration page of the LeftHand CMC. In vMSC environments, VMware HA allows for the automatic restart of virtual machines on the surviving site in cases where one site is suspended due to cluster partition. This behavior applies to cases where a non-redundant ISL link fails or if both links in a dual redundant ISL were to fail. In configurations where a dual ISL link is present (such as in VMSC certified configurations) the loss of a single ISL link will have no impact to the system operations.
 

Sample tested scenarios

 
Scenario HP LeftHand P4000 Array Behavior VMware HA Behavior
Single storage node single path failure P4000 node path failover occurs. All volumes remain connected. All ESXi sessions remain active. No impact observed
ESXi Single storage path failure No impact on volume availability. ESXi storage path fails over to the alternative path. All sessions remain active. No impact observed
Site-1 Single Storage node failure
Volume availability remains unaffected. ESXi iSCSI sessions affected by node failure, failover to surviving nodes. After failed node comes back online, all affected volumes resync automatically. Quorum is maintained.

Note: Volumes associated with failed node may or may not show unprotected in Centralized management Console depending on the Data Protection Level configured for the volume.
No impact observed
Site-2 Single Storage node failure
Volume availability remains unaffected. ESXi iSCSI sessions affected by node failure, failover to surviving nodes. After failed node comes back online, all affected volumes resync automatically. Quorum is maintained.

Note: Volumes associated with failed node may or may not show unprotected in Centralized management Console depending on the Data Protection Level configured for the volume.
No impact observed
Site-1 All storage node failure
Volume availability remains unaffected. ESXi iSCSI sessions affected by node failure, failover to surviving nodes. After failed node comes back online, all affected volumes resync automatically. Quorum is maintained.

Note: Volumes associated with failed node may or may not show unprotected in Centralized management Console depending on the Data Protection Level configured for the volume.
No impact observed
Site-2 All storage node failure
Volume availability remains unaffected. ESXi iSCSI sessions affected by node failure, failover to surviving nodes. After failed node comes back online, all affected volumes resync automatically. Quorum is maintained.

Note: Volumes associated with failed node may or may not show unprotected in Centralized management Console depending on the Data Protection Level configured for the volume.
No impact observed
Failover Manager Failure No impact on volume availability. All sessions remain active. No impact observed

Complete Site 1 failure, including ESXi and storage arrays

Volume availability remains unaffected. Quorum is maintained. iSCSI sessions to surviving ESXi nodes remain active. After failed node comes back online, all affected volumes resync automatically. Virtual machines on failed ESXi nodes fail. HA restarts failed virtual machines on ESXi hosts on Site 2.

Complete Site 2 failure, including ESXi and storage arrays

Volume availability remains unaffected. Quorum is maintained. iSCSI sessions to surviving ESXi nodes remain active. After failed node comes back online, all affected volumes resync automatically. Virtual machines on failed ESXi nodes fail. HA restarts failed virtual machines on ESXi hosts on Site 1.
Single ESXi failure (shutdown) No impact. Array continues to function normally. Virtual machines on failed ESXi node fail. HA restarts failed virtual machines on surviving ESXi hosts.
Multiple ESXi host management network failure No impact. Array continues to function normally.

No impact. As long is the storage heartbeat is on and virtual machines are accessible, HA does not initiate failover

Single Storage Inter-Site Link failure

No impact. Array continues to function normally.

Note: Redundant Inter-Site Links for storage network are required for this use case.
No Impact observed
Site 1 and Site 2 simultaneous failure (shutdown) and restoration Arrays boot up and resync. All volumes become available. All iSCSI sessions to ESXi hosts are re-established and virtual machines restarted successfully. As a best practice, P4000 arrays should be powered on first and allow the LUNs to become available before powering on the ESXi hosts. No Impact observed
Management ISL failure No impact to P4000 array. Volumes remain available If the HA host isolation response is set to Leave Powered On, virtual machines at each site continue to run as storage heartbeat is still active. Partitioned Hosts on site that does not have a Fault Domain Manager elect a new Primary.
CMC-Management Server failure No impact. Array continues to function normally. Array management functions however cannot be performed until CMC is up and running again. No Impact observed
vCenter Server failure No impact. Array continues to function normally

No Impact on HA. However, the DRS rules cannot be applied.