Implementing vSphere Metro Storage Cluster using Dell Storage Live Volume

Products

VMware vSphere ESXi

Issue/Introduction

This article provides information about designing and deploying a vSphere Metro Storage Cluster with Dell EMC™ SC Series Live Volume

Environment

VMware vSphere ESXi 6.7
VMware vSphere ESXi 6.0
VMware vSphere ESXi 7.0.0
VMware vSphere ESXi 6.5
VMware vSphere ESXi 5.5

Resolution

What is vMSC?

vSphere Metro Storage Cluster is a compute and storage virtualization solution certified by VMware. A vMSC architecture typically involves stretching a vSphere high availability cluster, along with its networking and underlying storage, across a supported distance to provide the highest achievable levels of workload availability.

What is Live Volume?

Live Volume is an innovative feature designed to deliver dynamic business continuity by keeping applications and data highly available during planned or unplanned downtime. Live Volume enhances SC Series array-based replication by presenting the same volume identifier from each array which appears as the same volume to the hosts. This makes the volume accessible from both the source array and destination array.

What is a Tiebreaker?

The tiebreaker is a service located at a site physically independent of SC Series arrays participating in a Metro Cluster. The tiebreaker service is an essential requirement for determining quorum during an unplanned event and for preventing split-brain conditions in the event a network or fabric partition occurs. The tiebreaker service is bundled with Dell™ Storage Manager (standard or Remote Data Collector) which can be installed on a physical Microsoft® Windows® server or a Windows virtual machine. Dell Storage Manager is also available as a downloadable Linux® virtual appliance.

Configuration Requirements

The vSphere Metro Storage Cluster design and implementation with Live Volume must meet the following requirements:

Dell Storage Center OS (SCOS) 6.7 or newer with both SC Series arrays having the same OS/firmware version
Dell Storage Manager (all versions), or Dell Enterprise Manager 2015 R2 or newer, with tiebreaker service located at a site physically independent of SC Series arrays
VMware vSphere 5.5 or newer (vSphere 7 support added as of DSM 2020 R1)
Dell EMC SC Series VMware vSphere Best Practices followed and configured
Uniform or non-uniform storage presentation of volumes to vSphere hosts
Fixed or Round Robin path selection policy (PSP)
Live Volume with automatic failover is supported with VMFS datastores or Raw Device Mappings (RDMs) only; virtual mode RDMs are supported beginning with SCOS 6.7; physical mode RDMs are supported beginning with SCOS 7.1
Live Volumes with Failover Automatically enabled
Live Volumes with ALUA reporting of non-optimized paths recommended
Fibre Channel or iSCSI synchronous high availability replication between SC Series arrays
Maximum latency for synchronous high availability replication should not exceed 10ms round trip time (RTT)
Maximum supported latency of vSphere management network 10ms RTT
Maximum supported latency from each SC Series array to tiebreaker 200ms RTT
Redundant vMotion network supporting a minimum throughput of 250Mbps

Solution Overview

A VMware-certified vSphere Metro Storage Cluster solution couples vSphere compute and Dell EMC storage virtualization to meet high-availability goals with infrastructure stretched across a given distance. An HA cluster is created with vSphere cluster nodes and SC Series arrays deployed at each site. The Dell Storage Manager Data Collector with tiebreaker service is deployed in a third location. Volumes are replicated through Fibre Channel or iSCSI between sites in Synchronous High Availability mode. Live Volume, automatic failover, and automatic recovery are configured on a per-volume basis. The volumes are presented to the vSphere cluster nodes and consumed as VMFS datastores or raw device mappings (RDMs). When an unplanned outage impacts a datastore or RDM configured with Live Volume automatic failover, SC Series storage and the tiebreaker service will provide high availability on the surviving SC Series array. Virtual mode RDMs are supported beginning with SCOS 6.7, and physical mode RDMs are supported beginning with SCOS 7.1. Live Volume supports both uniform and non-uniform storage presentation to the vSphere cluster nodes.

Uniform storage presentation depicts a design whereby both the primary and secondary Live Volumes are presented paths to all cluster nodes in both the local and remote sites. This is typical with local campus fabrics or sites stretched across shorter distances. The Round Robin or Fixed PSP may be configured for each volume presented to each local and remote host. The Live Volume ALUA feature, in conjunction with Round Robin, ensures front-end I/O traverses optimal paths to the primary Live Volume while available. Depending on the location of the primary Live Volume, the location of the workload, and the PSP, the I/O path from vSphere host HBA ports to SC Series array front-end ports is shorter if the I/O remains local within the site. Conversely, an I/O path will be longer if it traverses sites. An I/O path will be most efficient when the workload and the primary Live Volume reside within the same site. Complete path or storage failure within a site will allow virtual machines to continue operating through paths to the remote site, eliminating the need for vSphere HA virtual machine restarts. These are just a few design factors to consider when choosing the storage presentation and PSP.

Non-uniform storage presentation depicts a design whereby the primary or secondary Live Volumes are presented paths to the cluster nodes located within their respective local site only. This would be typical with sites stretched across longer distances. The Round Robin or Fixed PSP may be configured for each volume presented to each local host. In this storage presentation model, the I/O path from vSphere host HBA ports to SC Series array front-end ports remains local for a primary Live Volume. I/O sent to the secondary Live Volume starts local but will be proxied to the primary Live Volume over the replication link. Complete path or storage failure within a site would constitute an all paths down (APD) condition and disallow virtual machines to continue operating within that site. In this case, vSphere HA is configured to restart impacted virtual machines at the remote site. These are just a few design factors to keep in mind when choosing the storage presentation and PSP.

Front-end storage presentation and PSP use aside, Live Volume is dependent on synchronous replication between sites where the maximum round-trip latency between SC Series arrays should not exceed 10ms. Higher round-trip latency between sites and synchronous replication will impact virtual machine performance. Optimal bandwidth, throughput, and latency must be available both locally and between sites.

The following table outlines tested design and component-failure scenarios with Live Volume automatic failover enabled with vSphere HA.

Live Volume and vSphere HA Scenarios

Event Scenario	Live Volume Behavior	vSphere HA Behavior
Uniform: Complete site outage takes down primary Live Volumes	Primary Live Volumes automatically recovered at remote site	vSphere HA restarts impacted VMs at remote site
Uniform: Complete site outage takes down secondary Live Volumes	Primary Live Volumes remain available at remote site	vSphere HA restarts impacted VMs at remote site
Non-uniform: Complete site outage takes down primary Live Volumes	Primary Live Volumes automatically recovered at remote site	vSphere HA restarts impacted VMs at remote site
Non-uniform: Complete site outage takes down secondary Live Volumes	Primary Live Volumes remain available at remote site	vSphere HA restarts impacted VMs at remote site
Uniform: SC Series array controller-pair outage takes down primary Live Volumes	Primary Live Volumes automatically recovered at remote site	None — VMs remain running at both sites
Uniform: SC Series array controller-pair outage takes down secondary Live Volumes	Primary Live Volumes remain available at remote site	None – VMs remain running at both sites
Non-uniform: SC Series array controller-pair outage takes down primary Live Volumes	Primary Live Volumes automatically recovered at remote site	vSphere HA restarts impacted VMs at remote site
Non-uniform: SC Series array controller-pair outage takes down secondary Live Volumes	Primary Live Volumes remain available at remote site	vSphere HA restarts impacted VMs at remote site
Uniform: SC Series array back-end outage (ie. drive enclosure, multiple drive loss, etc.) takes down primary Live Volumes	Primary Live Volumes automatically recovered at remote site SC Series array continues to provide Live Volume access from both arrays	None — VMs remain running at both sites
Uniform: SC Series array back-end outage (ie. drive enclosure, multiple drive loss, etc.) takes down secondary Live Volumes	Primary Live Volumes remain available at remote site SC Series array continues to provide Live Volume access from both arrays	None — VMs remain running at both sites

Additional Information

For more information, see Dell TechCenter.