vSphere Stretch Cluster Solutions with IBM SAN Volume Controller HyperSwap
search cancel

vSphere Stretch Cluster Solutions with IBM SAN Volume Controller HyperSwap

book

Article ID: 341600

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

The IBM System Storage SAN Volume Controller is an enterprise class storage virtualization system that enables a single point of control for aggregated storage resources. SAN Volume Controller consolidates capacity from different storage systems, both IBM and non-IBM branded, while enabling common copy functions, non-disruptive data movement, and improving performance and availability. IBM SAN Volume Controller combines hardware and software into an integrated, modular solution that forms a highly scalable cluster. An IBM SAN Volume Controller cluster is commonly installed to provide service in a single data center. An IBM SAN Volume Controller cluster can also be installed in a “HyperSwap” configuration, where a single SAN Volume Controller cluster can provide service to two data centers at a maximum distance of 300km. In a HyperSwap configuration IBM SAN Volume Controller enables a highly available HyperSwap volume to be concurrently accessed at both data center locations.

IBM Spectrum Virtualize Software version 7.5 introduced the HyperSwap technology that provides an HA solution.

For VMware environments, HyperSwap supports multiple vSphere stretched storage cluster solutions, including: 
  • Highly-available active-active vSphere datastores 
  • Workload mobility 
  • Cross-site automated load balancing 
  • Enhanced downtime avoidance 
  • Disaster avoidance
Some of the solutions listed involve VMware Site Recovery Manager (SRM). The focus of this document is on solutions that rely on vMSC configurations, and do not require SRM.

VMware vSphere Metro Storage Cluster (vMSC) is a specific configuration within the VMware Hardware Compatibility List (HCL). These configurations are commonly referred to as stretched storage clusters or metro storage clusters and are implemented in environments where disaster and downtime avoidance is a key requirement.

Environment

VMware vSphere ESXi 6.0
VMware vSphere ESXi 7.x
VMware vSphere ESXi 6.5
VMware vSphere ESXi 6.7

Resolution

Configuration

HyperSwap solution components
A minimal HyperSwap solution consists of:
  • One system which consists of at least two I/O groups. Each I/O group is at a different site. Both nodes of an I/O group are at the same site.
  • HyperSwap-protected hosts, each connected to storage nodes via iSCSI or Fibre Channel or FCoE
  • Except for 2 sites which are defined as failure domain, a third site is needed to house a quorum disk or IP quorum application
Node

A node is an individual server upon which the IBM SAN Volume Controller cluster runs. The nodes are always installed in pairs. Each pair of nodes is known as an I/O group. I/O operations between hosts and system nodes and between the nodes and arrays use the SCSI standard. The nodes communicate with each other through private SCSI commands.

I/O group
A pair of nodes is known as an input/output (I/O) group. 

Volumes are logical disks that are presented to the SAN by nodes. Volumes are also associated with the I/O group. When an application server processes I/O to a volume, it can access the volume with either of the nodes in the I/O group. When you create a volume, you can specify a preferred node. The other node in the I/O group is used only if the preferred node is not accessible. If you do not specify a preferred node for a volume, the system selects the node in the I/O group that has the fewest volumes to be the preferred node.

The HyperSwap topology locates both nodes of an I/O Group in the same site. You should assign the nodes of at least one I/O group to each of sites 1 and 2. Therefore, to get a volume that is resiliently stored on both sites, at least two I/O Groups (4 nodes) are required. For larger systems using the HyperSwap system topology, if most volumes are configured as HyperSwap volumes, it’s preferable to have 2 full I/O groups on each site, instead of 1 I/O group on one site and 2 on the other. This is to avoid the site with only 1 I/O group becoming a bottleneck.

HyperSwap Volume and consistent group

HyperSwap volumes create copies on separate sites for systems that are configured with HyperSwap topology. Data that is written to a HyperSwap volume is automatically sent to both copies so that either site can provide access to the volume if the other site becomes unavailable. HyperSwap volumes are supported on SAN Volume Controller cluster that contain more than one I/O group.

HyperSwap is a system topology that enables disaster recovery and high availability between I/O groups at different locations. Before you configure HyperSwap volumes, the system topology needs to be configured for HyperSwap and sites must be defined. 

In addition, the management GUI creates an active-active relationship and change volumes automatically. Active-active relationships manage the synchronous replication of data between HyperSwap volume copies at the two sites. You can specify a consistency group that contains multiple active-active relationships to simplify management of replication and provide consistency across multiple volumes. A consistency group is commonly used when an application spans multiple volumes. Change volumes maintain a consistent copy of data during resynchronization. Change volumes allow an older copy to be used for disaster recovery when a failure occurred on the up-to-date copy before resynchronization completes.

Quorum

A SAN Volume Controller cluster quorum disk is a managed disk (MDisk) or managed drive that contains a reserved area that is used exclusively for cluster management. The cluster maintains one active quorum and two quorum candidates (or backup quorums). A cluster uses the quorum disk for two purposes:
  • To break a tie when a SAN fault occurs, when exactly half of the nodes that were previously a member of the cluster are present. 
  • To hold a copy of important cluster configuration data. Just over 256 MB is reserved for this purpose on each quorum disk. 
 A HyperSwap SAN Volume Controller cluster typically maintains the active quorum disk at a third site to ensure cluster availability is not impacted by a failure of either primary site. 

To use a quorum disk as the quorum device, this third site must use Fibre Channel connectivity together with an external storage system. Sometimes, Fibre Channel connectivity is not possible. Starting with Spectrum Virtualize software version 7.6 it is possible to use an IP-based quorum application as the quorum device for the third site, no Fibre Channel connectivity is used. Java applications are run on hosts at the third site. Even with IP quorum applications at the third site, quorum disks at site one and site two are required, because they are used to store metadata. 

Site

The ‘Site’ corresponds to a physical location that houses the physical objects of the system.

Site is also referred to as the failure domain.

In a SAN Volume Controller HyperSwap configuration, the term site is used to identify components of the SAN Volume Controller that are contained within a boundary so that any failure that occurs (such as a power failure, fire, or flood) is contained within that boundary. The failure therefore cannot propagate or affect components that are outside of that boundary. The components that make up a SAN Volume Controller HyperSwap configuration must span three independent sites. Two sites contain SAN Volume Controller I/O Groups and if you are virtualizing external storage, the storage controllers that contain customer data. The third site contains a storage controller where the active quorum disk is located.

Configuration Requirements
  • Directly connect each node to two or more SAN fabrics at the primary and secondary sites (2 - 8 fabrics are supported). Sites are defined as independent failure domains
  • Use a third site to house a quorum disk or IP quorum application. Quorum disks cannot be located on iSCSI-attached storage systems; therefore, iSCSI storage cannot be configured on a third site.
  • If a storage system is used at the third site, it must support extended quorum disks. More information is available in the interoperability matrixes that are available at the following website:
  • Place independent storage systems at the primary and secondary sites, and use active-active relationships to mirror the host data between the two sites. 
  • Connections can vary based on fibre type and small form-factor pluggable (SFP) transceiver (longwave and shortwave).
  • Nodes that have connections to switches that are longer than 100 meters (109 yards) must use longwave Fibre Channel connections. A longwave small form-factor pluggable (SFP) transceiver can be purchased as an optional component, and must be one of the longwave SFP transceivers that are listed at the following website:
  • Avoid using inter-switch links (ISLs) in paths between nodes and external storage systems. If this configuration is unavoidable, do not oversubscribe the ISLs because of substantial Fibre Channel traffic across the ISLs. For most configurations, trunking is required. Because ISL problems are difficult to diagnose, switch-port error statistics must be collected and regularly monitored to detect failures.
  • Using a single switch at the third site can lead to the creation of a single fabric rather than two independent and redundant fabrics. A single fabric is an unsupported configuration. 
  • Ethernet port 1 on every node must be connected to the same subnet or subnets. Ethernet port 2 (if used) of every node must be connected to the same subnet (this might be a different subnet from port 1). The same principle applies to other Ethernet ports.
  • Some service actions require physical access to all nodes in a system. If nodes in a HyperSwap system are separated by more than 100 meters, service actions might require multiple service personnel. Contact your service representative to inquire about multiple site support.
  • Use consistency groups to manage the volumes that belong to an application. This structure ensures that when a rolling disaster occurs, the out-of-date image is consistent and therefore usable for that application.
Table1 Metro Cluster Software Components
Metro Cluster Software ComponentsVersion
IBM SVC/Storwize Hyperswap7.7.0 or newer
VMware vSphere6.0 or newer

VMware environment
  • ESXi 6.0
For the most up-to-date information on the recommended version of ESXi, refer to the SAN Volume Controller release notes.
  • ESXi hosts should use NMP (Native MultiPathing) and the PSP Round Robin with SAN Volume Controller. This is the default for SAN Volume Controller, no action is required. 
  • For management and vMotion traffic, the ESXi hosts at both data centers must have a private network on the same IP subnet and broadcast domain. 
Preferably, management and vMotion traffic should be on separate networks.
  • The ESXi hosts at both data centers must have a private network on the same IP subnet and broadcast domain. 
  • The VMware vCenter must be accessible from all ESXi hosts at both data centers. 
  • The virtual machines IP network must be accessible to the ESXi hosts at both data centers. This assures transparency of any VMware HA event triggered to any virtual machine running on any ESXi host. 
  • All datastores used by the ESXi hosts and virtual machines must be accessible from ESXi hosts at both data centers. 
  • The datastore used by the ESXi hosts and virtual machines must be provisioned on HyperSwap volumes.

Topologies



Figure 1 HyperSwap configuration 

The HyperSwap function provides highly available volumes accessible through two sites at up to 300km apart. A fully-independent copy of the data is maintained at each site. When data is written by hosts at either site, both copies are synchronously updated before the write operation is completed. The HyperSwap function will automatically optimize itself to minimize data transmitted between sites and to minimize host read and write latency.

If the nodes or storage at either site go offline, leaving an online and accessible up-to-date copy, the HyperSwap function will automatically fail over access to the online copy. The HyperSwap function also automatically resynchronizes the two copies when possible. 

The HyperSwap function in the SAN Volume Controller software works with the standard multipathing drivers that are available on a wide variety of host types, with no additional host support required to access the highly available volume. Where multipathing drivers support ALUA, the storage system will tell the multipathing driver which nodes are closest to it, and should be used to minimize I/O latency. You just need to tell the storage system which site a host is connected to, and it will configure host pathing optimally. 

In Figure 1, if the IO group where the primary vdisk(V1_P) is gone then paths from second IO group take over and SAN Volume Controller internally manages to use the secondary vdisk(V1_S) to process data.
  • If you lose access to IO Group1 from the host then the host multi-pathing will automatically access the data via IO Group2
  • If you only lose primary copy of data then HyperSwap function will forward request to IO Group2 to service I/O
  • If you lose IO Group 1 entirely then the host multi-pathing will automatically access the other copy of the data on IO Group 2

User scenarios in a HyperSwap configuration 

#ScenarioSVC behaviorVMware vSphere behavior
1Using VMware vMotion or VMware Distributed Resource 
Scheduler (DRS) to migrate virtual machines between Site 1 and Site 2
According to the IO throughput, the HyperSwap function will automatically switch the direction of relationship to maximize the IO efficiency.I/O continues with the SVC site at Site 1.
 
2Failure of all ESXi hosts in Site 1 – Power OffAccording to the IO throughput, the HyperSwap function will automatically switch the direction of relationship to maximize the IO efficiency.
  • VMware HA automatically restarts the virtual machines on the available ESXi hosts in Site 2. 
  • There is no downtime if Fault Tolerance is configured 
    on the virtual machines.
3Host partial path failure (some paths are still alive)
 
No impact
 
No impact on virtual machines. ESXi I/O is redirected to any available active path via PSP (ALUA).
4Failure of all preferred paths on the host (from SVC site at Site 1); only nonpreferred paths are alive (from SVC site at Site 2)According to the IO throughput, the HyperSwap function will automatically switch the direction of relationship to maximize the IO efficiency.  
 
ESXi I/O is redirected to nonpreferred paths via PSP (ALUA). 
No impact on virtual machines.

 
5Failure of all paths on the host (APD) - no paths are alive
 
HA/DRS balances load on the surviving hosts in the cluster. According to the IO throughput, the HyperSwap function will automatically switch the direction of relationship to maximize the IO efficiency.  
 
Two options to recover the virtual machines:
  • ESXi hosts must be shut down manually for VMware 
    High Availability to restart virtual machines on the other hosts 
  • Enable the VMCP capability under the HA settings to handle datastore APD situation to restart virtual machines on the other hosts
6SVC site at Site 1 failsHyperSwap failover
  • Secondary HyperSwap volumes/consistency groups on 
    Site 2 become Primary
    in HyperSwap relations 
  • Host I/O is redirected to Site 2 
  • If SVC site at Site 1 has recovered from a failure and has been receiving a significant amount of writes for a while, the HyperSwap function will automatically switch the direction of relationship to maximize the IO efficiency
 
  • Active paths to SVC site at Site 1 are reported unavailable 
  • Active paths to SVC site at Site 2 become preferred 
  • No disruption to virtual machines and/or ESXi I/O 
7Site 1 failure - both ESXi hosts and SVCHyperSwap failover
  • Secondary HyperSwap volumes/consistency groups on 
    Data Center B become Primary in HyperSwap relations 
  • If Site 1 has recovered from a failure and has been receiving a significant amount of writes for a while, the HyperSwap function will automatically switch the direction of relationship to maximize the IO efficiency
  • VMware High Availability restarts failed virtual machines 
    on the available ESXi hosts at Site 2
  • There is no downtime if Fault Tolerance is configured on the 
    failed virtual machines.
8The SVC site that owns the Primary volume lost connectivity with the Quorum and with the SVC site that owns the Secondary volumeHyperSwap failover
  • Secondary HyperSwap volumes/ consistency groups on Site 2 become Primary in their respective HyperSwap relations. 
  • Host I/O is redirected to SVC site at Site 2
  • Active paths to HyperSwap volumes on Site 1 are reported unavailable 
  • Active paths to HyperSwap volume on Site 2 become preferred 
  • No disruption to virtual machines and/or ESXi I/O
9Site 2 failure -both ESXi hosts and SVCHyperSwap failover
  • Primary HyperSwap volumes/consistency groups on 
    Site 1 are not affected 
  • If Site2 has recovered from a failure and has been receiving a significant amount of writes for a while, the HyperSwap function will automatically switch the direction of relationship to maximize the IO efficiency
No disruption to virtual machines running on Site 1
 
10HyperSwap link failure
  • Synchronization between Primary and Secondary HyperSwap volumes/ consistency groups is broken 
  • Secondary HyperSwap volumes/ consistency groups stop serving host I/Os 
  • Primary HyperSwap volumes/ consistency groups continue 
    serving I/Os
  • No disruption to virtual machines and /or ESXi I/O 
  • Paths to Primary volumes/CGs remain Active/Preferred 
  • Paths to Secondary volumes/CGs become unavailable
11Quorum failureMirroring between Primary and Secondary volumes continues 
and both Primary and Secondary keep serving host I/Os
  • No disruption to virtual machines 
  • An additional failure at this point will not trigger automatic
    failover and can result in loss of access

Additional references
For Information about the use of IBM SVC HyperSwap with VMware Site Recovery Manager refer to: For more information, see VMware vSphere Metro Storage Cluster Recommended Practices.