High-Availability solution for VMware ESXi across two sites in a Metro environment with Fujitsu ETERNUS Storage Cluster V16.x
search cancel

High-Availability solution for VMware ESXi across two sites in a Metro environment with Fujitsu ETERNUS Storage Cluster V16.x

book

Article ID: 301228

calendar_today

Updated On:

Products

VMware vSphere ESXi

Issue/Introduction

This article describes the implementation of a High-Availability solution across two sites in a metro area using the combination of VMware HA/FT and the Fujitsu ETERNUS Storage Cluster feature in both FibreChannel and iSCSI SAN environments.
 
Note: The PVSP policy implies that the solution is not directly supported by VMware. For issues with this configuration, contact Fujitsu directly. See the Support Workflow on how partners can engage with VMware. It is the partner's responsibility to verify that the configuration functions with future vSphere major and minor releases as VMware does not guarantee that compatibility with future releases is maintained.
 
Disclaimer: The partner products referenced in this article is software that is developed and supported by a partner. Use of these products is also governed by the end user license agreements of the partner. You must obtain the application, support, and licensing for using these products from the partner.
 
For more information, see:


Environment

VMware ESXi 4.0.x Embedded
VMware ESXi 4.1.x Installable
VMware ESXi 4.1.x Embedded
VMware ESXi 3.5.x Embedded
VMware vSphere ESXi 5.5
VMware ESXi 4.0.x Installable
VMware ESXi 3.5.x Installable
VMware vSphere ESXi 5.0
VMware vSphere ESXi 6.0
VMware vSphere ESXi 5.1
VMware vSphere ESXi 6.5

Resolution

Solution overview

Storage Cluster is the high-availability feature of the ETERNUS AF and ETERNUS DX S3/S4 storage arrays. Data is synchronously mirrored between two interlinked storage systems. If the primary (active) system should fail, all primary host connections are switched instantly to the secondary (standby) system. This failover is transparent for both servers and applications, and it ensures uninterrupted operations. In addition, the failover can be executed in both directions and between different ETERNUS AF all-flash models and ETERNUS DX, thus supporting non-stop operations very efficiently.
 
This advanced feature helps fulfill service levels and delivers predictable operation for business- critical applications – particularly in virtualized server environments.
 
Storage Cluster in Virtualized Server Environments
 
Thanks to bidirectional mirroring in combination with VMware’s HA/FT functionalities, Storage Cluster helps overcome even complete site outages in virtualized multisite server configurations. It provides instantaneous, non-disruptive failover in the event of server or site failures, delivering protection from even the slightest lapse, disruption or data loss.
 
Storage Cluster configuration is based on remotely replicated Transparent Failover Volumes (TFOVs) which can be freely configured and paired. Thus, on one site both primary and secondary Transparent Failover (TFO) groups and TFOVs can be configured – connected via linked-up (active) or passive (linked-down) host ports, respectively.
 
 
Under normal conditions, VM1 and VM2 at site 1 and VM3 and VM4 at site 2 are connected to the respective ETERNUS AF/DX arrays located at the same site – this is called the active (primary) site. All data is synchronously replicated to the standby (secondary) site, while in this example ETERNUS B is secondary for VM1 and 2 and ETERNUS A is secondary for VM 3 and 4, respectively.
 
The ports on both sites have the same identity regarding WWPN. As the port on the primary site is in “link-up” status and the port on the secondary site is in “link-down” status, all server inputs/outputs are processed to the primary storage.
 

Solution details

Primary and secondary storage
 
The primary and secondary ETERNUS AF/DX storage array should preferably be located in different fire compartments – even better in different buildings or in metropolitan dispersion. Storage Cluster is set up using Transparent Failover Volumes (TFOV) which are part of a special copy group – the TFO group. Layout of the TFOVs and TFO groups are identical in the primary and secondary storage, even including the configuration settings for automated storage tiering, snapshots, etc. TFOVs are synchronously replicated from the primary to the secondary array, using the Remote Equivalent Copy (REC) feature of ETERNUS AF/DX
 
 
 
Channel adapter (CA) ports
 
Primary and secondary storage have paired CA ports. Paired means that the ports on both sites have the same identity regarding WWN/ WWPN or IP address. Under normal conditions the CA port on the primary site is in “link up” status, and the port on the secondary site is in “link down” status, so all server I/O is processed to the primary storage. The CA port states, as well as the mirror state, are controlled by the Storage Cluster feature.
 
 
Storage Cluster Controller
 
The Storage Cluster Controller is a server or virtual machine including an agent to connect with the ETERNUS SF management software. It monitors the health of primary and secondary storage in order to detect outages of the active system.
The Storage Cluster Controller triggers the automatic failover in this scenario only and is not involved in cases of administrator-triggered manual failover or in cases of automated failover caused by RAID failures.
 
 
ETERNUS SF Management Server
 
ETERNUS SF management is the prerequisite for setting up the Storage Cluster configuration with regard to TFO groups, TFOVs, copy groups and REC pairs. It also executes the failover and failback operations, either triggered by the Storage Cluster Controller in cases of automatic failover, or manually by an operator. It also executes the automated failover in cases of RAID failures on the primary array.
ETERNUS SF and the Storage Cluster Controller can be installed on the same physical or virtual server.
 
Failover Mechanism
Storage Cluster reroutes I/O access from one array to the other as seen in this figure.
 
 
If an outage occurs the failover sequence is executed as follows:
  1. The server sends I/O requests to the primary storage.
  2. The primary CA port does not respond, the Storage Cluster Controller detects the primary ETERNUS is unreachable and reports it to ETERNUS SF.
  3. The server retries the I/O after a preset time-out
  4. ETERNUS SF suspends the remote mirroring (REC) session, the replicated data becomes the actual business data.
  5. The CA port on the secondary array is activated (link up) with the same identity (WWN/WWPN or IP address) as the primary CA port.
  6. The server I/O is processed by the secondary storage before the retry time-out is exceeded. The application continues running without any restrictions.

    Note: Such an automatic failover is typically completed within less than three seconds in FibreChannel environments, which is sufficient for most applications to keep on running smoothly.
     
  7. If the primary site is unavailable entirely, VMware HA/FT will failover to the secondary site without interruption of the business applications.
 
Failover and Failback
 
Storage Cluster can handle different failover types and scenarios:
  • Automatic failover: Failover is triggered automatically when the primary storage becomes unreachable, a RAID group becomes unavailable or all the CA ports on the primary storage connected to an ESXi host are link down. This mode ensures business continuity in cases of unpredictable failures or a disaster at the primary site.
  • Manual failover: Failover is triggered manually from the ETERNUS SF user interface by stopping access on the primary storage and activating the secondary storage. This mode ensures business continuity when planned downtime is required on the primary site, e.g., for maintenance, disruptive upgrades or planned power shutdowns. It can also be used for general testing of the failover mechanism.
  • Force failover: Failover is triggered manually from the ETERNUS SF user interface by activating the secondary storage regardless of the status of the primary storage. This mode ensures business continuity in cases of emergency when the primary storage is unreachable, and for any reason the automatic failover cannot be executed.
  • Auto failback: The failover back from the secondary site to the primary site is triggered automatically under these conditions: The Storage Cluster Controller confirms that all systems are operative, the REC session is established and the business data and mirror data are in consistent state.
  • Manual failback: The failover back from the secondary site to the primary site is triggered manually via the ETERNUS SF user interface. This mode resets normal operation manually. The conditions of auto failback apply as well.
Minimum requirements and limitations
 
The minimum system requirements for implementing a Fujitsu ETERNUS Storage cluster with VMware HA/FT solution are as follows:
  • A combination of two freely selectable storage systems within the family: ETERNUS DX100 S3, DX100 S4, DX200 S3, DX200 S4, DX500 S3, DX600 S3, DX8700 S3, DX8900 S3, AF250, AF650.
  • Fibre Channel(FC)/iSCSI SAN to connect the ESXi hosts to the arrays and for the remote mirror link.
  • ETERNUS SF Storage Cruiser Standard License V16.1 or later (one license per array)
  • ETERNUS SF Storage Cruiser Storage Cluster Option V16.1 or later (one license per array).
  • ETERNUS SF AdvancedCopy Manager Remote Copy V16.1 or later (one license per array).
  • Front end connection failures occurring at the link between the ESXi host and the FC/iSCSI SAN switches are not detected by ETERNUS Storage Cluster.
  • The roundtrip time between the 2 storage arrays must not exceed 10 milliseconds. Due to these physical restriction on synchronous mirroring, Storage Cluster can be deployed in building, campus and metro environments.
  • In environments where iSCSI configurations are used for the host connection and the copy path, switching storage systems for a failover or failback requires approximately 30 to 120 seconds which is more time than for Fibre Channel (FC) configurations. Therefore, unlike FC configurations, a failover might not be performed transparently and applications may be become aware of the operation.
  • It is recommended to set the active path on the same CM on both arrays to avoid possible performance degradation after failover when used in conjunction with ESXi.
 
Verification method
 
The following table describes the failure scenarios tested for validating Fujitsu ETERNUS SF Storage Cluster with VMware HA/FT.
ScenarioETERNUS BehaviorHost side Impact/Observed VMware HA/FT Behavior
Primary site storage-side single path failurePrimary site ETERNUS continues to operate using an alternate path to the same host.No impact.
Primary site host-side single path failurePrimary site ETERNUS continues to operate using an alternate path to the same host.No impact.
Primary site storage manual failoverUpon the manual failover command, the primary site ETERNUS becomes standby and the secondary site ETERNUS becomes active.

Once the primary site storage is restored, equivalent copy restarts between both systems to get ready for failback.
No impact.
Primary site storage automatic failoverThe primary site ETERNUS becomes automatically standby and the secondary site ETERNUS becomes automatically active.

Once the primary site storage is restored, equivalent copy restarts between both systems to get ready for failback.
No impact.
In a primary site storage-side single path failure situation, Primary site storage manual failover. After the primary site storage is restored, manual failbackUpon the manual failover command, the primary site ETERNUS becomes standby and the secondary site ETERNUS becomes active.

Once the primary site storage is restored, equivalent copy restarts between both systems to get ready for failback. After the equivalent status is obtained, manual failback command restores the primary site as active.
No impact.
In a primary site host-side single path failure situation, Primary site storage manual failover. After the primary site storage is restored, manual failbackUpon the manual failover command, the primary site ETERNUS becomes standby and the secondary site ETERNUS becomes active.

Once the primary site storage is restored, equivalent copy restarts between both systems to get ready for failback. After the equivalent status is obtained, manual failback command restores the primary site as active.
No impact.
Primary-side storage all path failure after HA Cluster forced RebootThe primary site ETERNUS becomes automatically standby and the secondary site ETERNUS becomes automatically active.

Once the Primary Storage is restored, equivalent copy restarts between both systems to get ready for failback.
No impact.


Additional Information

简体中文:VMware ESXi 在具有 Fujitsu ETERNUS 存储群集 V16.x 的 Metro 环境中跨两个站点的高可用性解决方案