VSAN stretched cluster is a topology where a single vSphere cluster is spanned across 2 fault domains, to enable storage resilience on sites in pairs.
For more general information about VSAN Stretched clustering, see the configuration guide.
TPCF uses MySQL as its internal database. MySQL replication will replicate the data between the nodes, but in the case of VSAN stretched, the underlying disks will also be replicated back to the opposite site. As a result it is not recommended to run a HA MySQL cluster spanned across separate sites in a VSAN Stretched cluster, as this will cause the disks to be at different points in time during a failure. Because MySQL can’t be assigned to an availability zone separate to the TPCF tile, the entire foundation must be set to a single availability zone. This KB will describe how to pin the TPCF control plane to a single site/AZ, whilst using an Isolation Segment to allow the Diego Cells and Gorouters to be spread between both sites/AZs and what the implications are for DR.
To enable stretched clustering the following are needed but not covered in detail by this KB:
This process was tested on TPCF 10.0.x on vSphere 8u3 using VSAN ESA.
It is recommended that all DR tests listed below are carried out in the target environment at scale.
Any additional data services running alongside TPCF that have a quorum mechanism should be pinned to AZ1 to ensure that they run within the same site.
Because TPCF does not allow for the separation of the Diego Cells and Gorouters from the control plane when it comes to AZ definitions, a topology called “Non-isolated Isolation Segments” can be used to separate out the components. This involves deploying the TPCF tile without Diego Cells and then deploying an Isolation Segment, which deploys Diego Cells and Gorouters without setting up any isolation.
The table below explains how to recreate each failure state and the expected behaviour.
| Scenario | Test Method | Test Results |
| Primary site loss |
Power off all ESXi hosts on the primary site without following standard shutdown procedures. |
All system instances, Diego Cells and Gorouters fail across to the secondary site. The bootstrap errand is needed to recover the MySQL cluster. We observed that the control plane came up before the Diego cells, meaning that after a full recovery the containers remain balanced across Diego cells. Loss or partial completion of in-flight TPCF API requests. |
| Secondary site loss or Loss of inter-site connectivity |
Power off all ESXi hosts on the secondary site without following standard shutdown procedures. |
All Diego Cells and Gorouters running on the secondary site are shutdown and fail across to the primary site within ~10 minutes. TPCF will attempt to reschedule all containers on surviving cells on the primary site before the secondary site VMs have moved, assuming there is available disk and memory. This created an imbalance which needs to be remediated. See workaround below. No loss of TPCF API requests, as primary site when control plane is running is not affected. |
| Loss of connectivity to the Witness | Disable the network interface or power off the Witness VM | No interruption to service. |
| Witness loss AND Primary or Secondary site loss | Disable the network interface or power off the Witness VM AND power off one of this sites | All VMs on both sites are shut down as quorum is lost. |
| Either site put into maintenance mode | In vSphere put all hosts in a single site into maintenance mode | All VMs are vMotioned to the online site with no interruption to service. |
The MySQL cluster will not recover automatically if all nodes go down, but during this time the secondary site Diego cells and Gorouters will be available and able to service traffic.
When the second site fails TPCF will attempt to reschedule all containers on the primary site before the Diego cell VMs from the secondary site boot on the primary. This will cause an imbalance of containers once the secondary site is restored, as TPCF will not attempt to re-balance after all Diego cells recover. Note TPCF will not allow over-commit of memory, only CPU.
Run “bosh restart -d <isolation_segment_deployment_id> <isolated_diego_cell_instance_group> –no-converge” which will trigger the recreation of every VM and result in the workloads being rebalanced.
For large foundations the “--max-in-flight=” option can be used to make the process more aggressive by restarting more Diego Cells in parallel.