HCX - Bulk Migration & Replication Assisted vMotion (RAV) scalability guide

search cancel

HCX - Bulk Migration & Replication Assisted vMotion (RAV) scalability guide - for HCX version 4.7 to 4.9

book

Article ID: 321604

calendar_today

Updated On:

Products

VMware HCX

Issue/Introduction

This document describes the functional capacity for migrations using vSphere Replication (vSR) Bulk and Replication Assisted vMotion (RAV) in HCX.

The supported scale numbers are referenced per HCX Manager, irrespective of the number of Site Pairings or Service Mesh/IX Appliances deployed.

A configuration guide is provided within this document to increase the scale of concurrent Bulk/RAV migrations per HCX Manager beyond the default value for HCX Manager systems running HCX software version 4.7 to 4.9. For HCX Manager systems running HCX version 4.10 or higher the configuration guide procedure is documented in Knowledge article Refer HCX - Bulk Migration & Replication Assisted vMotion (RAV) scalability guide - for HCX version 4.10 or higher

Considerations for concurrent Migration

There are several factors, at both source & target HCX Manager, that can limit the number of concurrent migrations performed using Bulk & RAV (initial/delta sync):

Data storage
- IOPS capacity
- Shared vs. dedicated
Host resources
- Overall ESXi host resources for all services
- CPU & MEM reservations for the IX appliance VM
- pNIC/VMNIC capacity and shared load
- Dedicated vmk interfaces for different services like mgmt/vMotion/vSR.
Network Infrastructure throughout the entire data path
- Data Center local network
- Service Provider network infrastructure between source/target sites
- Bandwidth availability
- Latency and path reliability (packet loss)
  - vSphere replication (vSR) performance drops exponentially with higher packet loss and/or higher latency.
  - There is a built-in tolerance for high latency in vSphere replication but throughput will be reduced significantly.

Note: HCX Transport Analytics functionality can be used to measure network infrastructure throughput during migration planning phase. Refer Broadcom HCX user guide.

Workload VM conditions
- Number of disks
- Total and size per disk
- Active services/applications
- Data churning/disk changes

Default (Baseline) HCX Manager Resource Allocation:

vCPU	RAM (GB)	Disk Size (GB)
4	12	64

The supported numbers for concurrent Bulk/RAV migrations per Baseline HCX Manager deployments are:

300 concurrent migrations per Manager
200 concurrent migrations per Service Mesh/IX Appliance.
1Gbps max per migration workflow
1.6Gbps max per IX appliance (any number of concurrent migration workflows)

Resolution

Scale up Migration Concurrency

To improve concurrent migration scalability, resources on the HCX Connector & Cloud Manager must be increased as below

The following configuration guide is provided to increase the scale of concurrent Bulk/RAV migrations per HCX Manager beyond the default value for HCX Manager systems running HCX software version 4.7 to 4.9. For HCX Manager systems running HCX version 4.10 or higher the configuration guide procedure is documented in Knowledge article Refer HCX - Bulk Migration & Replication Assisted vMotion (RAV) scalability guide - for HCX version 4.10 or higher

Baseline Migration Concurrency

Supported 300 Migrations (Bulk & RAV) per HCX Manager

vCPU	RAM (GB)	Disk Size (GB)	Tuning
4	12	64	N/A

Extended Migration Concurrency

Supported 600 Migrations (Bulk & RAV) per HCX Manager

vCPU	RAM (GB)	Disk Size (GB)	Tuning
32	48	300	Y

Increase resources on the HCX Connector/Cloud Manager

The following procedure must be used to increase resource allocation on HCX Connector & Cloud Manager VM.

Requirements and Considerations before increasing resources on the HCX Connector & Cloud Manager

Do NOT exceed recommended allocations as that may cause the HCX Connector/Cloud Manager to malfunction.
Both HCX Cloud Manager and Connector must be running version HCX 4.7.0 or later.
There should be NO active migration or configuration workflows when making these resource changes.
Changes must be made during a scheduled Maintenance Window.
There is NO impact to Network Extension services.
There is NO change of concurrency for HCX vMotion/Cold Migration workflow.
The concurrent migration limit specified for HCX Replicated Assisted vMotion (RAV) is ONLY for Initial & Delta sync. During RAV switchover stage, only one relocation will be serviced at a time on a serial basis.
Additional service meshes/IX appliance should be deployed for unique workload clusters to aggregate the replication capacity of multiple IX appliances. A different Services Mesh can be deployed for each workload cluster at source and/or target.
If there are multiple service meshes/IX Appliances then RAV can switchover in parallel, however per SM/IX Pair it will always be sequential.

Procedure

IMPORTANT: It is recommended to take snapshots for HCX Connector & Cloud Manager VMs prior to executing steps.

Step 1: Increase the vCPU and memory of HCX Manager to 32 and 48GB respectively.

Login to vCenter that hosts the HCX Manager.
Shutdown HCX Manager VM's GuestOS using vCenter UI.
Edit HCX Manager's VM to increase the vCPU and MEM reservations. Refer to:
- Virtual CPU Configuration
- Virtual Memory Configuration
Power ON the HCX Manager VM.

Step 2: Add a 300GB disk to HCX Connector & Cloud Manager.

IMPORTANT: Following steps can be used to add a 300GB disk to both HCX managers. Refer Creating a new virtual disk for an existing Linux virtual machine

Mount the created disk to HCX managers.

mount  /dev/sdc1 /common_ext   
df -hT 
# Check if /common_ext has been mounted and has the correct type

Add an entry to "/etc/fstab" to ensure mounted disk will sustain a reboot and HCX Manager upgrade.

vi /etc/fstab   
/dev/sdc1 /common_ext ext3 rw,nosuid,nodev,exec,auto,nouser,async 1 2

Note: Use Linux VI editor to edit/modify the file.

1. Press the ESC key for normal mode.
2. Press "i" Key for insert mode.
3. Press ":q!" keys to exit from the editor without saving a file.
4. Press ":wq!" keys to save the updated file and exit from the editor.

Step 3: Stop HCX services as below:

# systemctl stop postgresdb
# systemctl stop zookeeper 
# systemctl stop kafka 
# systemctl stop app-engine 
# systemctl stop web-engine 
# systemctl stop appliance-management

Step 4: Redirect existing contents under "kafka-db" and "postgres-db" to the newly created disk.

Move directory "/common/kafka-db" to "/common/kafka-db.bak".

cd  /common
mv kafka-db kafka-db.bak

Create a new directory "/common_ext/kafka-db".

cd  /common_ext
mkdir kafka-db

Note: The contents inside Kafka doesn't require to be copied and will be generated after kafka/app-engine services restart.

Change the ownership and permissions of this directory same as "/common/kafka-db.bak".

chmod 755 kafka-db
chown kafka:kafka kafka-db

Make a soft link from "/common/kafka-db" to "/common_ext/kafka-db".

cd  /common
ln -s /common_ext/kafka-db kafka-db

Move directory "/common/postgres-db" to "/common/postgres-db.bak" as a backup

cd  /common
mv postgres-db postgres-db.bak

Copy the content for directory "/common/postgres-db.bak" to "/common_ext/postgres-db" and change the ownership to postgres.

Note: Use "-R" option to change the ownership for "/common_ext/postgres-db" as below:

cp -r /common/postgres-db.bak /common_ext/postgres-db
chown -R postgres:postgres /common_ext/postgres-db

Make a soft link from "/common/postgres-db" to "/common_ext/postgres-db".

cd  /common
ln -s /common_ext/postgres-db postgres-db

Step 5: Start HCX services as below:

# systemctl start postgresdb
# systemctl start zookeeper
# systemctl start kafka
# systemctl start app-engine
# systemctl start web-engine
# systemctl start appliance-management

Performance Tuning on the HCX Manager

In addition to increasing HCX resources, you must perform the following tuning steps to scale concurrent migrations.

Procedure

Step 6: Stop HCX services again.

Login to HCX Connector/Cloud Manager Root Console

# systemctl stop postgresdb
# systemctl stop zookeeper 
# systemctl stop kafka 
# systemctl stop app-engine 
# systemctl stop web-engine 
# systemctl stop appliance-management

Step 7: Increase memory page in app-engine framework.

Edit "app-engine-start" file to increase JAVA memory allocation and max perm size.

vi /etc/systemd/app-engine-start  
JAVA_OPTS="-Xmx4096m -Xms4096m -XX:MaxPermSize=1024m ...

Step 8: Increase thread pooling for Mobility Migration services.

Edit "MobilityMigrationService.zql" and "MobilityTransferService.zql" to increase thread numbers.

vi /opt/vmware/deploy/zookeeper/MobilityMigrationService.zql 
"numberOfThreads": "50",   

vi /opt/vmware/deploy/zookeeper/MobilityTransferService.zql  
"numberOfThreads":50,

Step 9: Increase message size limit for kafka framework.

Edit "vchsApplication.zql" and update "kafkaMaxMessageSizeBytes" from "2097152" to "4194304".

vi /opt/vmware/deploy/zookeeper/vchsApplication.zql 
"kafkaMaxMessageSizeBytes":4194304

Edit "kafka server.properties" and update "message.max.bytes" from "2097152" to "4194304".

vi /etc/kafka/server.properties  
message.max.bytes=4194304

Step 10: Start HCX services.

# systemctl start postgresdb
# systemctl start zookeeper 
# systemctl start kafka 
# systemctl start app-engine 
# systemctl start web-engine 
# systemctl start appliance-management

Step 11: Check the below services running in the HCX Connector/Cloud Manager:

admin@hcx [ ~ ]$ systemctl --type=service | grep "zoo\|kaf\|web\|app\|postgres"
  app-engine.service                   loaded active     running       App-Engine                                                        
  appliance-management.service         loaded active     running       Appliance Management                                              
  kafka.service                        loaded active     running       Kafka                                                             
  postgresdb.service                   loaded active     running       PostgresDB                                                                                              
  web-engine.service                   loaded active     running       WebEngine                                                         
  zookeeper.service                    loaded active     running       Zookeeper

IMPORTANT: In the event the HCX Manager fails to reboot OR any above listed services fail to start, revert the configuration changes immediately and ensure the system comes back on-line. Additionally, Snapshots can also be used to revert the above configurations incase of any failure while applying the steps.
Note: Snapshot revert process won't restore HCX Connector/Cloud Manager's compute resources vCPU/MEM. User must follow "Step 1" to restore vCPU and memory of HCX Manager to "8" and "12GB" respectively, if needed.

Required Steps During Future HCX Manager Upgrade

The steps performed in this scalability procedure are not persisted after an HCX Manager upgrade. It is required to perform the following scalability configuration steps after a software upgrade.
After upgrade, the user mapping for the postgres service may change. This may causes the postgres service to not start after upgrade. The below steps need to be performed after the successful upgrade of the HCX Manager.

NOTE These steps are only required for HCX Manager systems that were previously scaled-up using the steps specified in this document

Before upgrade with scale settings implemented:

root@hcx [ /common_ext ]# ls -l | grep postgres

drwx------ 19 postgres postgres 4096 Oct 31 11:33 postgres-db

After upgrade on the same system:

root@hcx [ /common_ext ]# ls -l

drwx------ 19 1001 appmgmt 4096 Oct 31 11:33 postgres-db

NOTE: User (1001) and group (appmgmt) mappings are arbitrary

Change the ownership and group for the postgres-db

root@hcx [ /common_ext ]# chown -R postgres:postgres postgres-db

root@hcx [ /common_ext ]# ls -l

drwx------ 19 postgres postgres 4096 Oct 31 12:41 postgres-db

Recommendations operating concurrent migrations at scale

As a best practice, use vSphere Monitoring and Performance to monitor HCX Connector & Cloud Manager CPU utilization and MEM usage.
Do NOT exceed the recommended limits as that could cause system instability and failed migration workflows.
In a scaled up environment, when migration operations are being processed, expect for the CPU utilization to increase significantly during a short periods of time and there may be a temporary delay in the UI response for migration progressing events.
Limit the concurrency of MON operations on target cloud when making configuration changes while having active concurrent Bulk migrations into MON enabled segments during switchover.
Follow the migration events and estimation on the HCX UI to determine any slowness that may be caused by the infrastructure or the network.
Additionally, vSphere Replication status can be monitored from the source ESXi host. Refer HCX : Bulk Migration operations and best practices
If a source ESXi host is heavily occupied from memory, I/O rate perspective, then replication performance will be affected. As a result, Bulk/RAV workflow may takes more time to complete initial base sync provided there are no slowness in the underlying datapath.

Note: In such cases, the recommendation is to relocate the source VM compute resources to another ESXi host probably a free one using native vCenter vMotion. This action won't impact ongoing replication process and do not require any changes in the migration workflow.

The Bulk/RAV migration workflow consists of multiple stages (i.e. initial/delta sync, off-line sync, disk consolidation, data checksum, VM instantiation, etc.) and most are not dependent of network infrastructure hence the time to complete a migration for any given VM, from start to finish, may vary depending on the conditions and it is not a simple calculation based on the size of the VM and the assumed network bandwidth.

Additional Information

Refer to HCX Configuration Limits
Refer to Network Underlay Characterization for more information about HCX dependencies on the network infrastructure between sites.
Refer to HCX Bulk Migration Operations & Best Practices
Contact your Cloud Provider regarding the availability of this procedure to scale up your cloud Data Center.
For Scale up requirements for VMConAWS Cloud, please open a service request with the Broadcom Support team.

Feedback

thumb_up Yes

thumb_down No