IBM FlashSystem A9000 and A9000R are all-flash block storage appliances that deliver consistently-high performance and simplified management, on top of grid-scale architecture, and comprehensive data reduction technology, for cloud and on-premisesstorage deployments.
Starting from software version 12.1, IBM FlashSystem A9000/R storage systems support
HyperSwap . This new functionality enables a highly-available, all-flash storage service by supporting cross-system, cross-datacenter, active-active configuration without extra licensing or special hardware.
For VMware environments, HyperSwap supports multiple vSphere stretched storage cluster solutions, including:
- Highly-available active-active vSphere datastores
- Workload mobility
- Cross-site automated load balancing
- Enhanced downtime avoidance
- Disaster avoidance
Some of the solutions listed involve
VMware Site Recovery Manager (SRM). The focus of this document is on solutions that rely on vMSC configurations, and do not require SRM.
VMware vSphere Metro Storage Cluster (vMSC) is a specific configuration within the VMware Hardware Compatibility List (HCL). These configurations are commonly referred to as stretched storage clusters or metro storage clusters and are implemented in environments where disaster and downtime avoidance is a key requirement.
Configuration
This section provides general information about the IBM HyperSwap solution components, the concept of a failure domain, the HyperSwap volume, the Quorum Witness, and configuration requirements.
HyperSwap solution components
A minimal HyperSwap solution consists of:
- Two IBM FlashSystem A9000/R storage systems, interconnected for synchronous replication via Fibre Channel
- HyperSwap-protected hosts, each connected to both systems via iSCSI or Fibre Channel
- Quorum Witness software, installed on a VM or a physical host, with TCP-IP connectivity to both systems.
The paired systems maintain one or more HyperSwap relationship between them. Each relationship facilitates one HyperSwap volume or HyperSwap consistency group.
Larger configurations are possible, since every IBM FlashSystem A9000/R system can have HyperSwap relationships with multiple other systems.
HyperSwap volumes and consistency groups
A HyperSwap volume is implemented as a pair of two volumes with identical SCSI attributes, one on each system. These volumes should be always synchronized. From the host perspective, the two volumes are a single volume, and I/O – both reads and writes - can be served from either system, depending on the path used. In other words, a HyperSwap solution allows a host to have active-active access to the same data on two systems.
- To be identical, a pair of volumes that constitute a HyperSwap volume have identical SCSI identity and I/O related attributes -size, lock and reservations. Each storage system maps the paired volumes separately, but the host perceives them as a single volume. This makes transitions, such as automatic failover and manual failback, transparent both to the hosts and the applications running on them.
- To be synchronized, the peer-systems are interconnected for synchronous replication, and a HyperSwap relationship is established between the peer-volumes. One of these volumes is initially designated as Primary, and the other is designated as Secondary. As opposed to mirroring, the replication between the Primary and Secondary volumes is bidirectional, to allow read and write I/O to be served on either volume. The purpose of the Primary/Secondary designation is to optimize latency: the Primary volume should be co-located with the hosts that generate most of the I/O.
Multiple HyperSwap volumes can exist between any HyperSwap-paired systems. Moreover, HyperSwap volumes can be grouped into A9000/R consistency groups. In such a case, HyperSwap actions will be applied to the entire consistency group, that is, to all the HyperSwap volumes in it.
Quorum Witness
Every clustering solution requires a Quorum Witness component that can be consulted by the cluster members at any time to avoid split-brain situations. In the HyperSwap solution, this function is performed by IBM Spectrum Accelerate Family HyperSwap Quorum Witness, a new software application that allows FlashSystem A9000/R arrays to determine, which array should own the Primary volume.
Connectivity between a Quorum Witness and storage systems is established via TCP-IP.
When the Quorum Witness is down for any reason, there is no impact on HyperSwap active active data access, and various failure scenarios can still be accommodated without disruption. However, when the Quorum Witness is down, automatic failover cannot be applied, since the A9000/R systems have no other way to ensure that a failover will not result in a split-brain situation. Therefore when the Quorum Witness is down, the risk of downtime is elevated. To minimize Quorum Witness disruption, it can be deployed as a highly-available VM on a VMware vSphere cluster, using
VMware High Availability.
Failure domain
A failure domain encompasses all the elements potentially affected by a single failure. For example, an earthquake or power-grid failure can potentially take down a whole datacenter. The datacenter is therefore a failure domain, with regards to earthquake and power-grid failures. To protect against such failures, the infrastructure can be divided between two geographically-separate datacenters that are not using the same power grid. Each data center is considered a failure domain, and if one of them fails, the other can replace it. Failure domains can also be defined within a single datacenter, depending on the kind of failure they protect against. For example, to protect against overheating, the datacenter may have multiple cooling systems. The area that is protected by each cooling system will be a failure domain, from the cooling perspective.
Therefore, for every two systems that are HyperSwap-paired, the best practice is to separate and spread the two systems and the Quorum Witness across three failure-domains to prevent the situation where more than one of these elements is down. The highest availability level is obtained when these three failure domains are 3 geographically separated sites, with separate power and network resources.
HyperSwap configuration requirements
The following configuration is required for A9000/R vMSC solution:
- A9000/R storage arrays:
- Location: It is highly recommended that HyperSwap-paired systems be installed in a separate failure domain
- Software version: 12.1.0 or later
- Host connectivity: Fibre Channelor iSCSI
- Mirror connectivity: For HyperSwap volumes Fibre Channel connectivity is required.
- Quorum Witness
- Location: It is highly recommended that the Quorum Witness would be installed on a separate failure domain
- Installation: The Quorum Witness can be installed on a bare metal host or as a VMware VM. If VMware VM is used, it is highly recommended to install the Quorum Witness as a VMware VM with Fault Tolerance.
- A Quorum Witness supports a minimum of 2 systems with the following configuration:
- 2 CPU cores
- 4GB RAM
- 40GB storage
For the comprehensive and most up-to-date information on the IBM Quorum Witness compatibility and requirements, refer to the latest
Quorum Witness release notes.
- Connectivity between the A9000/R storage arrays and the Quorum Witness
Note: All the following requirements pertain to the TCP-IP connection, end-to-end. For example, if the Quorum Witness is running as a VMware VM, the requirements are applicable also to the connectivity of the VM inside the ESXi.
- The required maximum packet loss: 0.1%.
If the packet loss rate is higher, the HyperSwap feature might not function as required. Throughput between the Quorum Node and the Quorum Witness will increase due to packet retransmission.
- The recommended high availability of connectivity: 99.999% (five-nines).
This requirement stems from the fact that the reliability of a HyperSwap solution can only be as good as the reliability of the weakest infrastructure element. Since the FlashSystem A9000/R systems availability is five-nines, the other components in the solution must be at leastfive-nines.
- All packets in the Quorum Witness VM network must be tagged as "assured forwarding (AF)".
For example, for the Quorum Witness running as a VMware VM, refer to the "Mark Traffic on a Distributed Port or Uplink Port" section of the ESXi and vCenter Server 6.0 documentation.
- Connectivity must be established between all grid controller management ports and the Quorum Witness.
- Network bandwidth: Each system that is connected to the Quorum Witness consumes up to 8mbps bandwidth. For example, the bandwidth that must be dedicated to the Quorum Witness, if it communicates with five A9000/R systems is as follows:
Example: 5 systems x 8mbps = 40 mbps
- The required maximum latency between QW and QN systems: 0.75 seconds.
If the latency is higher, the HyperSwap feature might not function as required. Throughput between the Quorum Node and the Quorum Witness will increase due to packet retransmission
- VMware environment
- ESXi 6.0
For the most up-to-date information on the recommended version of ESXi, refer to the FlashSystem A9000 or A9000R Release Notes.
- ESXi hosts should use NMP (Native MultiPathing) and the PSP Round Robin with A9000/R. This is the default for A9000/R, no action is required.
- For management and vMotion traffic, the ESXi hosts at both data centers must have a private network on the same IP subnet and broadcast domain.
Preferably, management and vMotion traffic should be on separate networks.
- The ESXi hosts at both data centers must have a private network on the same IP subnet and broadcast domain.
- The VMware vCenter must be accessible from all ESXi hosts at both data centers.
- The virtual machines IP network must be accessible to the ESXi hosts at both data centers. This assures transparency of any VMware HA event triggered to any virtual machine running on any ESXi host.
- All datastores used by the ESXi hosts and virtual machines must be accessible from ESXi hosts at both data centers.
- The datastore used by the ESXi hosts and virtual machines must be provisioned on HyperSwap volumes.
Topologies
Volume level perspective
The following figure focuses on a typical configuration of hosts, storage systems and a Quorum Witness.
Preferably, the host and the storage system initially designated for Primary volume should be located at the same site, and the Quorum Witness is deployed at a separate third site.
Host-to-storage-system paths, hereafter referred to as port groups, are optimized using Asymmetric Logical Unit Access (ALUA) support from the multipath driver. By assigning proper ALUA states (Preferred or Non-Preferred), the storage system informs the multipath driver which paths are preferred, to minimize I/O latency:
- Port groups to the system that currently owns the Primary volume are automatically marked as Active/Preferred
- Port groups to the system that currently owns the Secondary volume are automatically marked as Active/Non-Preferred.
As a result, Active/Preferred port groups receive the bulk of I/O and SCSI commands. The remaining I/O and SCSI commands are directed to the Active/Non-Preferred port groups, and are then forwarded to the Primary volume.
When the HyperSwap volume is activated, the Secondary volume is not synchronized, and read requests are redirected to the Primary volume until synchronization has completed. When the volumes are synchronized, the system that owns the Secondary volume serves read requests locally.
If the system that owns the Secondary volume is unable to perform I/O, the Secondary volume port group state changes to Unavailable. This usually happens due to a connectivity failure. The Primary volume remains active, therefore no automatic failover is needed. As soon as connectivity is restored, the volumes will be re-synchronized automatically.
If the system that owns the Primary volume is unable to perform I/O, the Primary volume port group state changes to Unavailable. As soon as the system that owns the Secondary volume receives the corresponding notification from the Quorum Witness, it performs a transparent failover, and the Secondary volume assumes the Primary role. When the system that was originally designated for the Primary volume is restored, recovery must be performed manually via CLI or IBM Hyper-Scale Manager, which involves switching the roles of the peer systems and re-activating the HyperSwap relationship between them.
System level perspective
Since every system can contain a mix of multiple Primary and Secondary volumes and have HyperSwap relationships with multiple systems, it is interesting to consider topologies at system level, with multiple HyperSwap volumes. Here are a few examples.
In a conventional Disaster Recovery topology (that is, where one system is located in a site that is designated by the storage operations for disaster recovery) one storage system ownsthe volumes designated as Primary, and the other system own the volumes designated as Secondary:
In a star-shaped Disaster Recovery topology, where one of the sites is designated by the storage operations as a Disaster Recovery site, a single storage system in the Disaster Recovery site is dedicated to simultaneously serving multiple other systems:
In a symmetrical system topology, both systems have volumes designated as Primary and Secondary, depending on the preferred location of the application:
Since every system can have HyperSwap relationships with up to ten other systems, and serve a mix of Primary and Secondary volumes, other topologies are possible as well.
Supported scenarios
The following scenarios are based on two HyperSwap solution configurations: Uniform host connectivity and Non-Uniform host connectivity.
In a uniform configuration, each host can access both the Primary and Secondary volumes. The uniform configuration is the best practice to protect a host from data access problems:
Use cases for a Uniform configuration:
# | Scenario | A9000/R behavior | VMware vSphere behavior |
1 | Using VMware vMotion or VMware Distributed Resource Scheduler (DRS) to migrate virtual machines between Data Center A and Data Center B. | The user can optionally switch the roles of the volumes/consistency group in order to change the role of the volumes/consistency group in Data Center B to Primary. | I/O continues with the storage system in Data Center A. |
2 | Failure of all ESXi hosts in Data Center A – Power Off | If the user issues the ha_switch_roles command, the I/O of the recovered virtual machine at Data Center B is served by the local storage system. | - VMware HA automatically restarts the virtual machines
on the available ESXi hosts in Data Center B. - There is no downtime if Fault Tolerance is configured
on the virtual machines.
|
3 | Host partial path failure (some paths are still alive) | No impact | No impact on virtual machines. ESXi I/O is redirected to any available active path via PSP (ALUA). |
4 | Failure of all preferred paths on the host (local storage); only nonpreferred paths are alive (remote storage) | The user can optionally switch the roles of the local and remote HyperSwap volumes or consistency groups in order to improve I/O latency. | ESXi I/O is redirected to nonpreferred paths via PSP (ALUA). No impact on virtual machines. |
5 | Failure of all paths on the host (APD) - no paths are alive | No impact | Two options to recover the virtual machines: - ESXi hosts must be shut down manually for VMware
High Availability to restart virtual machines on the other hosts - Enable the VMCP capability under the HA settings to handle
datastore APD situation to restart virtual machines on the other hosts
|
6 | Data Center A A9000/R storage system fails | HyperSwap failover - Secondary HyperSwap volumes/consistency groups on
Data Center B become Primary in HyperSwap relations - Host I/O is redirected to Data Center B
- When Data Center A A9000/R storage system is recovered,
manual recovery is required to restore the original configuration.
| - Active paths to Data Center A A9000/R are reported
unavailable - Active paths to Data Center B A9000/R become preferred
- No disruption to virtual machines and/or ESXi I/O
- In case the issue is not solved, it is recommended to
move the virtual machines to Data Center B.
|
7 | Data Center A failure - both ESXi hosts and A9000/R | HyperSwap failover - Secondary HyperSwap volumes/consistency groups on
Data Center B become Primary in HyperSwap relations - When Data Center A is recovered, manual recovery is
required to restore the original configuration
| - VMware High Availability restarts failed virtual machines
on the available ESXi hosts at Data Center B - There is no downtime if Fault Tolerance is configured on the
failed virtual machines.
|
8 | The storage system that owns the Primary volume lost connectivity with the Quorum Witness and with the storage system that owns the Secondary volume | HyperSwap failover - Secondary HyperSwap volumes/ consistency groups on
Data Center B become Primary in their respective HyperSwap relations. - Host I/O is redirected to Data Center B storage system
- Volumes initially designated as Primary stop serving I/O
| - Active paths to HyperSwap volumes on Data Center A
are reported unavailable - Active paths to HyperSwap volume on Data Center B
become preferred - No disruption to virtual machines and/or ESXi I/O
|
9 | Data Center B failure (both ESXi hosts and A9000/R) | HyperSwap failover - Primary HyperSwap volumes/consistency groups on
Data Center A are not affected - When Data Center B is recovered, manual recovery
is required to restore the original configuration.
| No disruption to virtual machines running on Data Center A |
10 | Storage mirror link failure | - Synchronization between Primary and Secondary HyperSwap
volumes/ consistency groups is broken - Secondary HyperSwap volumes/ consistency groups
stop serving host I/Os - Primary HyperSwap volumes/ consistency groups continue
serving I/Os
| - No disruption to virtual machines and /or ESXi I/O
- Paths to Primary volumes/CGs remain Active/Preferred
- Paths to Secondary volumes/CGs become unavailable
|
11 | Quorum Witness server failure | Mirroring between Primary and Secondary volumes continues and both Primary and Secondary keep serving host I/Os | - No disruption to virtual machines
- An additional failure at this point will not trigger automatic
failover and can result in loss of access
|
In a non-uniform configuration, the storage high availability relies on the server detection and failover of the application to the server with access to active storage. A non-uniform configuration can be used when the host is part of a cluster, can fail over to another host in the cluster, and the other host is connected to the peer system. This configuration is less costly from network perspective. However, it relies on the host failover, which in most cases would not be necessary in a uniform configuration.
Test cases for a Non-Uniform configuration:
# | Scenario | A9000/R behavior | VMware vSphere behavior |
1 | Using VMware vMotion or VMware Distributed Resource Scheduler (DRS) to migrate virtual machines between Data Center A and Data Center B | The user can optionally switch the roles of the volumes/consistency group in order to change the role of the volume/ consistency group in Data Center B to Primary. | I/O continues with the storage system in Data Center A. |
2 | Failure of all ESXi hosts in Data Center A – Power Off | If the user issues the ha_switch_roles command, the I/O of the recovered virtual machine at Data Center B is served by the local storage system. | - VMware HA automatically restarts the virtual machines
on the available ESXi hosts in Data Center B. - There is no downtime if Fault Tolerance is configured
on the virtual machines.
|
3 | Host partial path failure - some paths are still alive | No change | No impact on virtual machines. ESXi I/O is redirected to any available active path via PSP (ALUA). |
4 | Failure of all paths on the host (APD) - no paths are alive | No change | Two options to recover the virtual machines: - ESXi hosts must be shut down manually for VMware High
Availability to restart virtual machines on the other hosts - Enable the VMCP capability under the HA settings to handle
datastore APD situation to restart virtual machines on the other hosts
|
5 | Data Center A A9000/R storage system fails | HyperSwap failover - Secondary HyperSwap volumes/consistency groups on Data
Center B become Primary in HyperSwap relations - When Data Center A A9000/R storage systems is recovered,
manual recovery is required to restore original configuration
| Two options to recover the virtual machines: - ESXi hosts must be shut down manually for VMware High
Availability to restart virtual machines on the other hosts - Enable the VMCP capability under the HA settings to handle
APD situation to restart virtual machines on the other hosts. Active paths to Data Center B A9000/R become preferred.
|
6 | Data Center A failure (both ESXi hosts and A9000/R) | HyperSwap failover - Secondary HyperSwap volumes/consistency groups on Data
Center B become Primary in HyperSwap relations - When Data Center A is recovered, manual recovery is required
to restore the original configuration.
| - VMware High Availability restarts failed virtual machines on
the available ESXi hosts at Data Center B - There is no downtime if Fault Tolerance is configured on the
failed virtual machines.
|
7 | The storage system that owns the Primary volume lost connectivity with the Quorum Witness and with the storage system that owns the Secondary volume | HyperSwap failover - Secondary HyperSwap volumes/consistency groups on Data
Center B become Primary in their respective HyperSwap relations. - When Data Center A A9000/R storage systems is recovered,
manual recovery is required to restore original configuration.
| Two options to recover the virtual machines: - ESXi hosts must be shut down manually for VMware High
Availability to restart virtual machines on the other hosts - Enable the VMCP capability under the HA settings to handle
APD situation to restart virtual machines on the other hosts. Active paths to Data Center B A9000/R become preferred.
|
8 | Data Center B failure (both ESXi hosts and A9000/R) | HyperSwap failover - Primary HyperSwap volumes/consistency groups on Data
Center A are not affected - When Data Center B is recovered, manual recovery is
required to restore the original configuration.
| No disruption to virtual machines running on Data Center A |
9 | Storage mirror link failure | - Synchronization between Primary and Secondary HyperSwap
volumes/consistency groups is broken - Secondary HyperSwap volumes/consistency groups stop
serving host I/O - Primary HyperSwap volumes/consistency groups continue
serving I/Os
| No disruption to virtual machines and /or ESXi I/O |
10 | Quorum Witness server failure | - Mirroring between Primary and Secondary volumes continues
and both Primary and Secondary keep serving host I/Os
| - No disruption to virtual machines
- An additional failure at this point will not trigger automatic
failover and can result in loss of access
|
Additional references