vSphere Stretch Cluster Solutions with IBM FlashSystem A9000 & A9000R HyperSwap
search cancel

vSphere Stretch Cluster Solutions with IBM FlashSystem A9000 & A9000R HyperSwap

book

Article ID: 334473

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This article provides information on vSphere Stretch Cluster Solutions with IBM FlashSystem A9000 & A9000R HyperSwap.

Environment

VMware vSphere ESXi 6.0

Resolution

IBM FlashSystem A9000 and A9000R are all-flash block storage appliances that deliver consistently-high performance and simplified management, on top of grid-scale architecture, and comprehensive data reduction technology, for cloud and on-premisesstorage deployments.

Starting from software version 12.1, IBM FlashSystem A9000/R storage systems support HyperSwap . This new functionality enables a highly-available, all-flash storage service by supporting cross-system, cross-datacenter, active-active configuration without extra licensing or special hardware.

For VMware environments, HyperSwap supports multiple vSphere stretched storage cluster solutions, including:
  • Highly-available active-active vSphere datastores
  • Workload mobility
  • Cross-site automated load balancing
  • Enhanced downtime avoidance
  • Disaster avoidance
Some of the solutions listed involve VMware Site Recovery Manager (SRM). The focus of this document is on solutions that rely on vMSC configurations, and do not require SRM.

VMware vSphere Metro Storage Cluster (vMSC) is a specific configuration within the VMware Hardware Compatibility List (HCL). These configurations are commonly referred to as stretched storage clusters or metro storage clusters and are implemented in environments where disaster and downtime avoidance is a key requirement.

Configuration
This section provides general information about the IBM HyperSwap solution components, the concept of a failure domain, the HyperSwap volume, the Quorum Witness, and configuration requirements.

HyperSwap solution components
A minimal HyperSwap solution consists of:
  • Two IBM FlashSystem A9000/R storage systems, interconnected for synchronous replication via Fibre Channel
  • HyperSwap-protected hosts, each connected to both systems via iSCSI or Fibre Channel
  • Quorum Witness software, installed on a VM or a physical host, with TCP-IP connectivity to both systems.
The paired systems maintain one or more HyperSwap relationship between them. Each relationship facilitates one HyperSwap volume or HyperSwap consistency group.

Larger configurations are possible, since every IBM FlashSystem A9000/R system can have HyperSwap relationships with multiple other systems.

HyperSwap volumes and consistency groups
A HyperSwap volume is implemented as a pair of two volumes with identical SCSI attributes, one on each system. These volumes should be always synchronized. From the host perspective, the two volumes are a single volume, and I/O – both reads and writes - can be served from either system, depending on the path used. In other words, a HyperSwap solution allows a host to have active-active access to the same data on two systems.
  • To be identical, a pair of volumes that constitute a HyperSwap volume have identical SCSI identity and I/O related attributes -size, lock and reservations. Each storage system maps the paired volumes separately, but the host perceives them as a single volume. This makes transitions, such as automatic failover and manual failback, transparent both to the hosts and the applications running on them.
  • To be synchronized, the peer-systems are interconnected for synchronous replication, and a HyperSwap relationship is established between the peer-volumes. One of these volumes is initially designated as Primary, and the other is designated as Secondary. As opposed to mirroring, the replication between the Primary and Secondary volumes is bidirectional, to allow read and write I/O to be served on either volume. The purpose of the Primary/Secondary designation is to optimize latency: the Primary volume should be co-located with the hosts that generate most of the I/O.
Multiple HyperSwap volumes can exist between any HyperSwap-paired systems. Moreover, HyperSwap volumes can be grouped into A9000/R consistency groups. In such a case, HyperSwap actions will be applied to the entire consistency group, that is, to all the HyperSwap volumes in it.

Quorum Witness
Every clustering solution requires a Quorum Witness component that can be consulted by the cluster members at any time to avoid split-brain situations. In the HyperSwap solution, this function is performed by IBM Spectrum Accelerate Family HyperSwap Quorum Witness, a new software application that allows FlashSystem A9000/R arrays to determine, which array should own the Primary volume.

Connectivity between a Quorum Witness and storage systems is established via TCP-IP.

When the Quorum Witness is down for any reason, there is no impact on HyperSwap active active data access, and various failure scenarios can still be accommodated without disruption. However, when the Quorum Witness is down, automatic failover cannot be applied, since the A9000/R systems have no other way to ensure that a failover will not result in a split-brain situation. Therefore when the Quorum Witness is down, the risk of downtime is elevated. To minimize Quorum Witness disruption, it can be deployed as a highly-available VM on a VMware vSphere cluster, using VMware High Availability.

Failure domain
A failure domain encompasses all the elements potentially affected by a single failure. For example, an earthquake or power-grid failure can potentially take down a whole datacenter. The datacenter is therefore a failure domain, with regards to earthquake and power-grid failures. To protect against such failures, the infrastructure can be divided between two geographically-separate datacenters that are not using the same power grid. Each data center is considered a failure domain, and if one of them fails, the other can replace it. Failure domains can also be defined within a single datacenter, depending on the kind of failure they protect against. For example, to protect against overheating, the datacenter may have multiple cooling systems. The area that is protected by each cooling system will be a failure domain, from the cooling perspective.

Therefore, for every two systems that are HyperSwap-paired, the best practice is to separate and spread the two systems and the Quorum Witness across three failure-domains to prevent the situation where more than one of these elements is down. The highest availability level is obtained when these three failure domains are 3 geographically separated sites, with separate power and network resources.

HyperSwap configuration requirements
The following configuration is required for A9000/R vMSC solution:
  • A9000/R storage arrays:
    • Location: It is highly recommended that HyperSwap-paired systems be installed in a separate failure domain
    • Software version: 12.1.0 or later
    • Host connectivity: Fibre Channelor iSCSI
    • Mirror connectivity: For HyperSwap volumes Fibre Channel connectivity is required.

  • Quorum Witness
    • Location: It is highly recommended that the Quorum Witness would be installed on a separate failure domain
    • Installation: The Quorum Witness can be installed on a bare metal host or as a VMware VM. If VMware VM is used, it is highly recommended to install the Quorum Witness as a VMware VM with Fault Tolerance.
      • A Quorum Witness supports a minimum of 2 systems with the following configuration:
        • 2 CPU cores
        • 4GB RAM
        • 40GB storage
For the comprehensive and most up-to-date information on the IBM Quorum Witness compatibility and requirements, refer to the latest Quorum Witness release notes.
  • Connectivity between the A9000/R storage arrays and the Quorum Witness
Note: All the following requirements pertain to the TCP-IP connection, end-to-end. For example, if the Quorum Witness is running as a VMware VM, the requirements are applicable also to the connectivity of the VM inside the ESXi.
    • The required maximum packet loss: 0.1%.

      If the packet loss rate is higher, the HyperSwap feature might not function as required. Throughput between the Quorum Node and the Quorum Witness will increase due to packet retransmission.

    • The recommended high availability of connectivity: 99.999% (five-nines).

      This requirement stems from the fact that the reliability of a HyperSwap solution can only be as good as the reliability of the weakest infrastructure element. Since the FlashSystem A9000/R systems availability is five-nines, the other components in the solution must be at leastfive-nines.

    • All packets in the Quorum Witness VM network must be tagged as "assured forwarding (AF)".

      For example, for the Quorum Witness running as a VMware VM, refer to the "Mark Traffic on a Distributed Port or Uplink Port" section of the ESXi and vCenter Server 6.0 documentation.

    • Connectivity must be established between all grid controller management ports and the Quorum Witness.

    • Network bandwidth: Each system that is connected to the Quorum Witness consumes up to 8mbps bandwidth. For example, the bandwidth that must be dedicated to the Quorum Witness, if it communicates with five A9000/R systems is as follows:
      Example: 5 systems x 8mbps = 40 mbps

    • The required maximum latency between QW and QN systems: 0.75 seconds.

      If the latency is higher, the HyperSwap feature might not function as required. Throughput between the Quorum Node and the Quorum Witness will increase due to packet retransmission
  • VMware environment
    • ESXi 6.0

      For the most up-to-date information on the recommended version of ESXi, refer to the FlashSystem A9000 or A9000R Release Notes.

    • ESXi hosts should use NMP (Native MultiPathing) and the PSP Round Robin with A9000/R. This is the default for A9000/R, no action is required.
    • For management and vMotion traffic, the ESXi hosts at both data centers must have a private network on the same IP subnet and broadcast domain.

      Preferably, management and vMotion traffic should be on separate networks.
    • The ESXi hosts at both data centers must have a private network on the same IP subnet and broadcast domain.
    • The VMware vCenter must be accessible from all ESXi hosts at both data centers.
    • The virtual machines IP network must be accessible to the ESXi hosts at both data centers. This assures transparency of any VMware HA event triggered to any virtual machine running on any ESXi host.
    • All datastores used by the ESXi hosts and virtual machines must be accessible from ESXi hosts at both data centers.
    • The datastore used by the ESXi hosts and virtual machines must be provisioned on HyperSwap volumes.

Topologies
Volume level perspective

The following figure focuses on a typical configuration of hosts, storage systems and a Quorum Witness.

Preferably, the host and the storage system initially designated for Primary volume should be located at the same site, and the Quorum Witness is deployed at a separate third site.


Host-to-storage-system paths, hereafter referred to as port groups, are optimized using Asymmetric Logical Unit Access (ALUA) support from the multipath driver. By assigning proper ALUA states (Preferred or Non-Preferred), the storage system informs the multipath driver which paths are preferred, to minimize I/O latency:
  • Port groups to the system that currently owns the Primary volume are automatically marked as Active/Preferred
  • Port groups to the system that currently owns the Secondary volume are automatically marked as Active/Non-Preferred.
As a result, Active/Preferred port groups receive the bulk of I/O and SCSI commands. The remaining I/O and SCSI commands are directed to the Active/Non-Preferred port groups, and are then forwarded to the Primary volume.

When the HyperSwap volume is activated, the Secondary volume is not synchronized, and read requests are redirected to the Primary volume until synchronization has completed. When the volumes are synchronized, the system that owns the Secondary volume serves read requests locally.

If the system that owns the Secondary volume is unable to perform I/O, the Secondary volume port group state changes to Unavailable. This usually happens due to a connectivity failure. The Primary volume remains active, therefore no automatic failover is needed. As soon as connectivity is restored, the volumes will be re-synchronized automatically.

If the system that owns the Primary volume is unable to perform I/O, the Primary volume port group state changes to Unavailable. As soon as the system that owns the Secondary volume receives the corresponding notification from the Quorum Witness, it performs a transparent failover, and the Secondary volume assumes the Primary role. When the system that was originally designated for the Primary volume is restored, recovery must be performed manually via CLI or IBM Hyper-Scale Manager, which involves switching the roles of the peer systems and re-activating the HyperSwap relationship between them.

System level perspective
Since every system can contain a mix of multiple Primary and Secondary volumes and have HyperSwap relationships with multiple systems, it is interesting to consider topologies at system level, with multiple HyperSwap volumes. Here are a few examples.

In a conventional Disaster Recovery topology (that is, where one system is located in a site that is designated by the storage operations for disaster recovery) one storage system ownsthe volumes designated as Primary, and the other system own the volumes designated as Secondary:


In a star-shaped Disaster Recovery topology, where one of the sites is designated by the storage operations as a Disaster Recovery site, a single storage system in the Disaster Recovery site is dedicated to simultaneously serving multiple other systems:


In a symmetrical system topology, both systems have volumes designated as Primary and Secondary, depending on the preferred location of the application:


Since every system can have HyperSwap relationships with up to ten other systems, and serve a mix of Primary and Secondary volumes, other topologies are possible as well.

Supported scenarios
The following scenarios are based on two HyperSwap solution configurations: Uniform host connectivity and Non-Uniform host connectivity.

In a uniform configuration, each host can access both the Primary and Secondary volumes. The uniform configuration is the best practice to protect a host from data access problems:


Use cases for a Uniform configuration:
#ScenarioA9000/R behaviorVMware vSphere behavior
1Using VMware vMotion or VMware Distributed Resource
Scheduler (DRS) to migrate virtual machines between
Data Center A and Data Center B.
The user can optionally switch the roles of the volumes/consistency
group in order to change the role of the volumes/consistency group
in Data Center B to Primary.
I/O continues with the storage system in Data Center A.
2Failure of all ESXi hosts in Data Center A – Power OffIf the user issues the ha_switch_roles command, the I/O of the
recovered virtual machine at Data Center B is served by the local
storage system.
  • VMware HA automatically restarts the virtual machines
    on the available ESXi hosts in Data Center B.
  • There is no downtime if Fault Tolerance is configured
    on the virtual machines.
3Host partial path failure (some paths are still alive)No impactNo impact on virtual machines. ESXi I/O is redirected to
any available active path via PSP (ALUA).
4Failure of all preferred paths on the host (local storage);
only nonpreferred paths are alive (remote storage)
The user can optionally switch the roles of the local and remote
HyperSwap volumes or consistency groups in order to improve
I/O latency.
ESXi I/O is redirected to nonpreferred paths via PSP (ALUA).
No impact on virtual machines.
5Failure of all paths on the host (APD) - no paths are aliveNo impactTwo options to recover the virtual machines:
  • ESXi hosts must be shut down manually for VMware
    High Availability to restart virtual machines on the other hosts
  • Enable the VMCP capability under the HA settings to handle
    datastore APD situation to restart virtual machines on the
    other hosts
6Data Center A A9000/R storage system failsHyperSwap failover
  • Secondary HyperSwap volumes/consistency groups on
    Data Center B become Primary in HyperSwap relations
  • Host I/O is redirected to Data Center B
  • When Data Center A A9000/R storage system is recovered,
    manual recovery is required to restore the original
    configuration.
  • Active paths to Data Center A A9000/R are reported
    unavailable
  • Active paths to Data Center B A9000/R become preferred
  • No disruption to virtual machines and/or ESXi I/O
  • In case the issue is not solved, it is recommended to
    move the virtual machines to Data Center B.
7Data Center A failure - both ESXi hosts and A9000/RHyperSwap failover
  • Secondary HyperSwap volumes/consistency groups on
    Data Center B become Primary in HyperSwap relations
  • When Data Center A is recovered, manual recovery is
    required to restore the original configuration
  • VMware High Availability restarts failed virtual machines
    on the available ESXi hosts at Data Center B
  • There is no downtime if Fault Tolerance is configured on the
    failed virtual machines.
8The storage system that owns the Primary volume lost
connectivity with the Quorum Witness and with the
storage system that owns the Secondary volume
HyperSwap failover
  • Secondary HyperSwap volumes/ consistency groups on
    Data Center B become Primary in their respective
    HyperSwap relations.
  • Host I/O is redirected to Data Center B storage system
  • Volumes initially designated as Primary stop serving I/O
  • Active paths to HyperSwap volumes on Data Center A
    are reported unavailable
  • Active paths to HyperSwap volume on Data Center B
    become preferred
  • No disruption to virtual machines and/or ESXi I/O
9Data Center B failure (both ESXi hosts and A9000/R)HyperSwap failover
  • Primary HyperSwap volumes/consistency groups on
    Data Center A are not affected
  • When Data Center B is recovered, manual recovery
    is required to restore the original configuration.
No disruption to virtual machines running on Data Center A
10Storage mirror link failure
  • Synchronization between Primary and Secondary HyperSwap
    volumes/ consistency groups is broken
  • Secondary HyperSwap volumes/ consistency groups
    stop serving host I/Os
  • Primary HyperSwap volumes/ consistency groups continue
    serving I/Os
  • No disruption to virtual machines and /or ESXi I/O
  • Paths to Primary volumes/CGs remain Active/Preferred
  • Paths to Secondary volumes/CGs become unavailable
11Quorum Witness server failureMirroring between Primary and Secondary volumes continues
and both Primary and Secondary keep serving host I/Os
  • No disruption to virtual machines
  • An additional failure at this point will not trigger automatic
    failover and can result in loss of access

In a non-uniform configuration, the storage high availability relies on the server detection and failover of the application to the server with access to active storage. A non-uniform configuration can be used when the host is part of a cluster, can fail over to another host in the cluster, and the other host is connected to the peer system. This configuration is less costly from network perspective. However, it relies on the host failover, which in most cases would not be necessary in a uniform configuration.

Test cases for a Non-Uniform configuration:
#ScenarioA9000/R behaviorVMware vSphere behavior
1Using VMware vMotion or VMware Distributed Resource
Scheduler (DRS) to migrate virtual machines between Data
Center A and Data Center B
The user can optionally switch the roles of the volumes/consistency
group in order to change the role of the volume/ consistency group
in Data Center B to Primary.
I/O continues with the storage system in Data Center A.
2Failure of all ESXi hosts in Data Center A – Power OffIf the user issues the ha_switch_roles command, the I/O of the
recovered virtual machine at Data Center B is served by the local
storage system.
  • VMware HA automatically restarts the virtual machines
    on the available ESXi hosts in Data Center B.
  • There is no downtime if Fault Tolerance is configured
    on the virtual machines.
3Host partial path failure - some paths are still aliveNo changeNo impact on virtual machines. ESXi I/O is redirected to any
available active path via PSP (ALUA).
4Failure of all paths on the host (APD) - no paths are aliveNo changeTwo options to recover the virtual machines:
  • ESXi hosts must be shut down manually for VMware High
    Availability to restart virtual machines on the other hosts
  • Enable the VMCP capability under the HA settings to handle
    datastore APD situation to restart virtual machines on the other
    hosts
5Data Center A A9000/R storage system failsHyperSwap failover
  • Secondary HyperSwap volumes/consistency groups on Data
    Center B become Primary in HyperSwap relations
  • When Data Center A A9000/R storage systems is recovered,
    manual recovery is required to restore original configuration
Two options to recover the virtual machines:
  • ESXi hosts must be shut down manually for VMware High
    Availability to restart virtual machines on the other hosts
  • Enable the VMCP capability under the HA settings to handle
    APD situation to restart virtual machines on the other hosts.
    Active paths to Data Center B A9000/R become preferred.
6Data Center A failure (both ESXi hosts and A9000/R)HyperSwap failover
  • Secondary HyperSwap volumes/consistency groups on Data
    Center B become Primary in HyperSwap relations
  • When Data Center A is recovered, manual recovery is required
    to restore the original configuration.
  • VMware High Availability restarts failed virtual machines on
    the available ESXi hosts at Data Center B
  • There is no downtime if Fault Tolerance is configured on the
    failed virtual machines.
7The storage system that owns the Primary volume lost
connectivity with the Quorum Witness and with the storage
system that owns the Secondary volume
HyperSwap failover
  • Secondary HyperSwap volumes/consistency groups on Data
    Center B become Primary in their respective HyperSwap
    relations.
  • When Data Center A A9000/R storage systems is recovered,
    manual recovery is required to restore original configuration.
Two options to recover the virtual machines:
  • ESXi hosts must be shut down manually for VMware High
    Availability to restart virtual machines on the other hosts
  • Enable the VMCP capability under the HA settings to handle
    APD situation to restart virtual machines on the other hosts.
    Active paths to Data Center B A9000/R become preferred.
8Data Center B failure (both ESXi hosts and A9000/R)HyperSwap failover
  • Primary HyperSwap volumes/consistency groups on Data
    Center A are not affected
  • When Data Center B is recovered, manual recovery is
    required to restore the original configuration.
No disruption to virtual machines running on Data Center A
9Storage mirror link failure
  • Synchronization between Primary and Secondary HyperSwap
    volumes/consistency groups is broken
  • Secondary HyperSwap volumes/consistency groups stop
    serving host I/O
  • Primary HyperSwap volumes/consistency groups continue
    serving I/Os
No disruption to virtual machines and /or ESXi I/O
10Quorum Witness server failure
  • Mirroring between Primary and Secondary volumes continues
    and both Primary and Secondary keep serving host I/Os
  • No disruption to virtual machines
  • An additional failure at this point will not trigger automatic
    failover and can result in loss of access

Additional references
For Information about the use of IBM FlashSystem A9000/R HyperSwap with VMware Site Recovery Manager refer to:


Additional Information