Deploying TPCF on a VSAN Stretched Cluster

Products

VMware vSAN VMware Tanzu Application Service VMware Tanzu Application Service for VMs

Issue/Introduction

VSAN stretched cluster is a topology where a single vSphere cluster is spanned across 2 fault domains, to enable storage resilience on sites in pairs.

For more general information about VSAN Stretched clustering, see the configuration guide.

TPCF uses MySQL as its internal database. MySQL replication will replicate the data between the nodes, but in the case of VSAN stretched, the underlying disks will also be replicated back to the opposite site. As a result it is not recommended to run a HA MySQL cluster spanned across separate sites in a VSAN Stretched cluster, as this will cause the disks to be at different points in time during a failure. Because MySQL can’t be assigned to an availability zone separate to the TPCF tile, the entire foundation must be set to a single availability zone. This KB will describe how to pin the TPCF control plane to a single site/AZ, whilst using an Isolation Segment to allow the Diego Cells and Gorouters to be spread between both sites/AZs and what the implications are for DR.

Resolution

To enable stretched clustering the following are needed but not covered in detail by this KB:

A 3rd site is needed to host the VSAN witness VM that is used to prevent split brain scenarios
Network ingress/egress must be resilient
A global service load balancer such as Avi must be configured for the Gorouters and Diego brain if site failure detection is needed

This process was tested on TPCF 10.0.x on vSphere 8u3 using VSAN ESA.

It is recommended that all DR tests listed below are carried out in the target environment at scale.

Any additional data services running alongside TPCF that have a quorum mechanism should be pinned to AZ1 to ensure that they run within the same site.

Setup and Architecture

Because TPCF does not allow for the separation of the Diego Cells and Gorouters from the control plane when it comes to AZ definitions, a topology called “Non-isolated Isolation Segments” can be used to separate out the components. This involves deploying the TPCF tile without Diego Cells and then deploying an Isolation Segment, which deploys Diego Cells and Gorouters without setting up any isolation.

Settings in vSphere on a stretched cluster

Create 2 host groups, each containing the hosts from a single site

Settings for the Bosh tile

Under “Director Config” de-select “Enable VM Resurrector Plugin” to prevent race conditions between Bosh and vSphere HA.
Under “Create Availability Zones” create 2 availability zones, each mapping to a single host group and set the “VM-Host Affinity Rule” to “SHOULD”. This will allow the VMs to move between sites if needed and vSphere DRS move them back automatically.

Settings for the TPCF tile

Pin the foundation to the first availability zone on the “Assign AZs and Networks” tab, for both singleton and other jobs.
Under “Advanced Features” select “Do not deploy Diego cells”.
The Gorouters can be scaled down to 1 instance, as it is not possible to scale down to zero and will be unused.
The wildcard certificates can be shared with the Isolation Segment.

Settings for the Isolation Segment tile

On the “Assign AZs and Networks” set “other” jobs to run on both availability zones.
On the Isolation Segment tile under “Compute and Networking Isolation” use the following defaults, which will add the Diego Cells to the default pool and grant the Gorouters access to everything.
Set the Isolation Segment Gorouter certificates that you would use as standard on the main TPCF tile for both the Apps and System domain.
Point your DNS servers to the Gorouters in the Isolation Segment.

DR Scenarios and Behaviors

The table below explains how to recreate each failure state and the expected behaviour.

Scenario	Test Method	Test Results
Primary site loss	Power off all ESXi hosts on the primary site without following standard shutdown procedures.	All system instances, Diego Cells and Gorouters fail across to the secondary site. The bootstrap errand is needed to recover the MySQL cluster. We observed that the control plane came up before the Diego cells, meaning that after a full recovery the containers remain balanced across Diego cells. Loss or partial completion of in-flight TPCF API requests.
Secondary site loss or Loss of inter-site connectivity	Power off all ESXi hosts on the secondary site without following standard shutdown procedures.	All Diego Cells and Gorouters running on the secondary site are shutdown and fail across to the primary site within ~10 minutes. TPCF will attempt to reschedule all containers on surviving cells on the primary site before the secondary site VMs have moved, assuming there is available disk and memory. This created an imbalance which needs to be remediated. See workaround below. No loss of TPCF API requests, as primary site when control plane is running is not affected.
Loss of connectivity to the Witness	Disable the network interface or power off the Witness VM	No interruption to service.
Witness loss AND Primary or Secondary site loss	Disable the network interface or power off the Witness VM AND power off one of this sites	All VMs on both sites are shut down as quorum is lost.
Either site put into maintenance mode	In vSphere put all hosts in a single site into maintenance mode	All VMs are vMotioned to the online site with no interruption to service.

Additional Recovery Steps

Primary site failure - control plane recovery

The MySQL cluster will not recover automatically if all nodes go down, but during this time the secondary site Diego cells and Gorouters will be available and able to service traffic.

Workaround

Check MySQL VMs are booted and are not in an “unresponsive agent” state.
Run the bootstrap errand to regain quorum in the MySQL cluster and recover the control plane. After completion the foundation will recover and the failed over Diego Cells will come up.

Secondary site failure - container rebalance

When the second site fails TPCF will attempt to reschedule all containers on the primary site before the Diego cell VMs from the secondary site boot on the primary. This will cause an imbalance of containers once the secondary site is restored, as TPCF will not attempt to re-balance after all Diego cells recover. Note TPCF will not allow over-commit of memory, only CPU.

Workaround

Run “bosh restart -d <isolation_segment_deployment_id> <isolated_diego_cell_instance_group> –no-converge” which will trigger the recreation of every VM and result in the workloads being rebalanced.

For large foundations the “--max-in-flight=” option can be used to make the process more aggressive by restarting more Diego Cells in parallel.