HCX - Bulk Migration & Replication Assisted vMotion (RAV) scalability guide
search cancel

HCX - Bulk Migration & Replication Assisted vMotion (RAV) scalability guide

book

Article ID: 321604

calendar_today

Updated On:

Products

VMware HCX

Issue/Introduction

This document describes the functional capacity for migrations using vSphere Replication (vSR) Bulk and Replication Assisted vMotion (RAV) in HCX.
The supported scale numbers are referenced per HCX Manager, irrespective of the number of Site Pairings or Service Mesh/IX Appliances deployed.
A Configuration Guide is provided within this document to increase the scale of concurrent Bulk/RAV migrations per HCX Manager beyond the default value if desired.

Considerations for concurrent Migration

There are several factors, at both source & target HCX Manager, that can limit the number of concurrent migrations performed using Bulk & RAV (initial/delta sync):
  • Data storage
    • IOPS capacity
    • Shared vs. dedicated
  • Host resources
    • Overall ESXi host resources for all services
    • CPU & MEM reservations for the IX appliance VM
    • pNIC/VMNIC capacity and shared load
    • Dedicated vmk interfaces for different services like mgmt/vMotion/vSR.
  • Network Infrastructure throughout the entire data path
    • Data Center local network
    • Service Provider network infrastructure between source/target sites
    • Bandwidth availability
    • Latency and path reliability (packet loss)
      • vSphere replication (vSR) performance drops exponentially with higher packet loss and/or higher latency.
      • There is a built-in tolerance for high latency in vSphere replication but throughput will be reduced significantly.
Note: HCX Transport Analytics functionality can be used to measure network infrastructure throughput during migration planning phase. Refer Broadcom HCX user guide.
  • Workload VM conditions
    • Number of disks
    • Total and size per disk
    • Active services/applications
    • Data churning/disk changes

Default (Baseline) HCX Manager Resource Allocation:
 
vCPU RAM (GB) Disk Size (GB)
4 12 64

The supported numbers for concurrent Bulk/RAV migrations per Baseline HCX Manager deployments are:
  • 300 concurrent migrations per Manager
  • 200 concurrent migrations per Service Mesh/IX Appliance.
  • 1Gbps max per migration workflow
  • 1.6Gbps max per IX appliance (any number of concurrent migration workflows)

Resolution

 
The following Configuration Guide to increase the scale of concurrent Bulk/RAV migrations per HCX Manager is split into 3 sections depending on the HCX software version installed
 
Section 1) New Procedure - HCX version 4.10 or newer
Section 2) Legacy Procedure - HCX version 4.7 - 4.9
Section 3) Recommendations operating concurrent migrations at scale (all HCX version 4.7 or newer)
 
It is highly recommended to use the new procedure introduced with HCX version 4.10 or newer. The benefits of the new procedure are:
  • Allows up to 1000 concurrent Bulk/RAV Migrations per HCX Manager. This is an increased from the 600 scale-up value in HCX version 4.7 - 4.9 and the 300 default value in HCX versions prior to 4.7
  • Configurable scale settings on HCX Managers (default, medium, large)
  • Scale related configuration changes are persisted after an HCX Manager upgrade
  • A script is available to automate the required configuration change on the HCX Manager
  • Increased /common disk partition space on the HCX Manager

The configuration steps are executed on each HCX Manager and the supported scale numbers are referenced per HCX Manager, irrespective of the number of Site Pairings or Service Mesh/IX Appliances deployed

Section 1) New Procedure - HCX version 4.10 or newer

HCX 4.10 or newer introduces automation via an executable script that allows a scaled concurrent Bulk/RAV migration size setting of default, medium, and large on each HCX Manager. These form factors include pre-defined settings for disk space, memory allocation for app-engine and number of threads for different sizes of migrations executed on each HCX Manager

 

Scale Form factor
vCPU count
Memory in GB
Storage in GB
Concurrent Bulk/Rav Migrations per HCX Manager
Default 4 12 64 300
Medium 8 24 120 600
Large 16 48 300 1000

Scenarios of upscale configuration

Case 1: HCX Manager upgraded from 4.7-4.9 to 4.10 or newer but "Section 2) Legacy Procedure - HCX version 4.7 - 4.9" was not applied previously

  • Scale form factor is not set
  • User must increase VM compute and storage to medium or large scale form factor as in section A) below
  • User can execute upscale_configs.sh script to medium or large scale form factor as in section B) below

Case 2: HCX Manager upgraded from 4.7-4.9 to 4.10 or newer with "Section 2) Legacy Procedure - HCX version 4.7 - 4.9" applied previously

  • Scale form factor is reset
  • User must increase VM compute and storage to medium or large scale form factor as in section A) below
  • User can execute upscale_configs.sh script to medium or large scale form factor as in section B) below

Case 3: HCX Manager is newly deployed with default scale form factor

  • User must increase VM compute and storage to medium or large scale form factor as in section A) below
  • User can increase /common disk partition space as mentioned in section C) below
  • User can execute upscale_configs.sh script to medium or large scale form factor as in section B) below

Case 4: HCX Manager is upgraded from 4.10.0.0 to 4.10+ with scale form factor not applied

  • User must increase VM compute and storage to medium or large scale form factor in section A) below
  • User can increase /common disk partition space as mentioned in section C) below
  • User can execute upscale_configs.sh script to medium or large scale form factor as in section B) below

Case 5: HCX Manager is upgraded from 4.10.0.0 to 4.10+ with scale form factor applied

  • HCX Manager will retain already applied settings after upgrade unless better configurations are decided for the predefined scale form factors
  • User must increase VM compute and storage to medium or large scale form factor as in section A) below
  • User can increase /common disk partition space as mentioned in section C) below
  • User can execute upscale_configs.sh script to medium or large scale form factor as in section B) below

Steps to perform upscale configuration of HCX Managers

A) Ensure each HCX Manager has the appropriate CPU/Memory/Disk space for the required scale form factor

Scale Form factor
vCPU count
Memory in GB
Storage in GB
Concurrent Bulk/Rav Migrations per HCX Manager
Default 4 12 64 300
Medium 8 24 120 600
Large 16 48 300 1000

 

Procedure to increase resources on the HCX Connector/Cloud Manager

The following procedure must be used to increase resource allocation on HCX Connector & Cloud Manager VM both

Requirements and Considerations before increasing resources on the HCX Connector & Cloud Manager

  • Do NOT exceed recommended allocations as that may cause the HCX Connector/Cloud Manager to malfunction.
  • Both HCX Cloud Manager and Connector must be running version HCX 4.10.0 or newer
  • There should be NO active migration or configuration workflows when making these resource changes.
  • Changes must be made during a scheduled Maintenance Window.
  • There is NO impact to Network Extension services.
  • There is NO change of concurrency for HCX vMotion/Cold Migration workflow.
  • The concurrent migration limit specified for HCX Replicated Assisted vMotion (RAV) is ONLY for Initial & Delta sync. During RAV switchover stage, only one relocation will be serviced at a time on a serial basis.
  • Additional service meshes/IX appliance should be deployed for unique workload clusters to aggregate the replication capacity of multiple IX appliances. A different Services Mesh can be deployed for each workload cluster at source and/or target.
  • If there are multiple service meshes/IX Appliances then RAV can switchover in parallel, however per SM/IX Pair it will always be sequential.

Procedure

IMPORTANT: 
It is recommended to take snapshots for HCX Connector & Cloud Manager VMs prior to executing steps.

Step 1: Increase the vCPU and memory of HCX Manager to match the desired scale factor in the above table

Step 2: Add a 120GB or 300GB Storage disk to HCX Connector & Cloud Manager based on the desired scale factor in the above table

IMPORTANT: Following steps can be used to add a 120GB or 300GB disk to both HCX managers. Refer to Broadcom Knowledge Article 316591 for creating a new virtual disk to an existing Linux virtual machine.

  • Mount the created disk to HCX managers.
mount  /dev/sdc1 /common_ext   
df -hT 
# Check if /common_ext has been mounted and has the correct type
  • Add an entry to "/etc/fstab" to ensure mounted disk will sustain a reboot and HCX Manager upgrade.
vi /etc/fstab   
/dev/sdc1 /common_ext ext3 rw,nosuid,nodev,exec,auto,nouser,async 1 2

Note: Use Linux VI editor to edit/modify the file.

1. Press the ESC key for normal mode.
2. Press "i" Key for insert mode.
3. Press ":q!" keys to exit from the editor without saving a file.
4. Press ":wq!" keys to save the updated file and exit from the editor.

Step 3: Stop HCX services as below:
# systemctl stop postgresdb
# systemctl stop zookeeper 
# systemctl stop kafka 
# systemctl stop app-engine 
# systemctl stop web-engine 
# systemctl stop appliance-management

Step 4: Redirect existing contents under "kafka-db" and "postgres-db" to the newly created disk.

  • Move directory "/common/kafka-db" to "/common/kafka-db.bak".
cd  /common
mv kafka-db kafka-db.bak
  • Create a new directory "/common_ext/kafka-db".
cd  /common_ext
mkdir kafka-db

Note: The contents inside Kafka doesn't require to be copied and will be generated after kafka/app-engine services restart.

  • Change the ownership and permissions of this directory same as "/common/kafka-db.bak".
chmod 755 kafka-db
chown kafka:kafka kafka-db
  • Make a soft link from "/common/kafka-db" to "/common_ext/kafka-db".
cd  /common
ln -s /common_ext/kafka-db kafka-db
  • Move directory "/common/postgres-db" to "/common/postgres-db.bak" as a backup
cd  /common
mv postgres-db postgres-db.bak
  • Copy the content for directory "/common/postgres-db.bak" to "/common_ext/postgres-db" and change the ownership to postgres.

Note: Use "-R" option to change the ownership for "/common_ext/postgres-db" as below:

cp -r /common/postgres-db.bak /common_ext/postgres-db
chown -R postgres:postgres /common_ext/postgres-db
  • Make a soft link from "/common/postgres-db" to "/common_ext/postgres-db".
cd  /common
ln -s /common_ext/postgres-db postgres-db

Step 5: Start HCX services as below:

# systemctl start postgresdb
# systemctl start zookeeper
# systemctl start kafka
# systemctl start app-engine
# systemctl start web-engine
# systemctl start appliance-management

 

B) Execute upscale_configs.sh script to medium or large scale form factor

  1. Login to HCX Manager SSH Console using 'admin' user.
  2. Switch to 'root' user
  3. Change directory to '/usr/local/hcx/sbin'
  4. Execute upscale_configs.sh using below command (the app-engine software process will be automatically restarted)

    sh upscale_configs.sh medium
    OR
    sh upscale_configs.sh large
  5. Wait until app-engine restarts completely before attempting to access the HCX UI to perform operations
    systemctl status app-engine

C) Add more disk space to '/common' partition in HCX Manager (HCX 4.10 or newer)

  1. Ensure section A) was followed where a new disk was added to the HCX Manager Virtual Machine of required size from vCenter Server
  2. Login to HCX Manager SSH Console using 'admin' user
  3. Switch to 'root' user
  4. Verify current partition with '/common' mount path

    # df -h
     
    Filesystem             Size  Used Avail Use% Mounted on
    /dev/root              7.6G  4.0G  3.3G  55% /
    devtmpfs               5.9G     0  5.9G   0% /dev
    tmpfs                  5.9G   64K  5.9G   1% /dev/shm
    tmpfs                  2.4G  644K  2.4G   1% /run
    tmpfs                  4.0M     0  4.0M   0% /sys/fs/cgroup
    /dev/sda2               10M  2.0M  8.1M  20% /boot/efi
    /dev/sda4              7.6G   92K  7.3G   1% /slot2
    /dev/mapper/vg01-lv01   44G  3.0G   39G   8% /common
    tmpfs                  1.2G     0  1.2G   0% /run/user/1000
  5. To rescan disks on HCX Manager execute following command

    for host in /sys/class/scsi_host/*; do echo "- - -" | sudo tee $host/scan; ls /dev/sd* ; done
     
    //Response
    - - -
    /dev/sda  /dev/sda1  /dev/sda2  /dev/sda3  /dev/sda4  /dev/sda5  /dev/sda6
    - - -
    /dev/sda  /dev/sda1  /dev/sda2  /dev/sda3  /dev/sda4  /dev/sda5  /dev/sda6
    - - -
    /dev/sda  /dev/sda1  /dev/sda2  /dev/sda3  /dev/sda4  /dev/sda5  /dev/sda6  /dev/sdb
  6. To create and add partition to existing '/common' partition using LVM execute following commands

    # pvcreate /dev/sdb
     
    Physical volume "/dev/sdb" successfully created.
     
    # vgextend vg01 /dev/sdb
     
    Volume group "vg01" successfully extended
     
    # lvm lvextend -l +100%FREE /dev/vg01/lv01
     
    Size of logical volume vg01/lv01 changed from 44.39 GiB (11364 extents) to <164.39 GiB (42083 extents).
    Logical volume vg01/lv01 successfully resized.
     
    # resize2fs -p /dev/mapper/vg01-lv01
     
    resize2fs 1.46.5 (30-Dec-2021)
    Filesystem at /dev/mapper/vg01-lv01 is mounted on /common; on-line resizing required
    old_desc_blocks = 3, new_desc_blocks = 11
    The filesystem on /dev/mapper/vg01-lv01 is now 43092992 (4k) blocks long.
  7. Verify if partition has been extended

    # df -h
     
    Filesystem             Size  Used Avail Use% Mounted on
    /dev/root              7.6G  4.0G  3.3G  55% /
    devtmpfs               5.9G     0  5.9G   0% /dev
    tmpfs                  5.9G   64K  5.9G   1% /dev/shm
    tmpfs                  2.4G  652K  2.4G   1% /run
    tmpfs                  4.0M     0  4.0M   0% /sys/fs/cgroup
    /dev/sda2               10M  2.0M  8.1M  20% /boot/efi
    /dev/sda4              7.6G   92K  7.3G   1% /slot2
    /dev/mapper/vg01-lv01  162G  3.1G  152G   2% /common
    tmpfs                  1.2G     0  1.2G   0% /run/user/1000

 

 

Section 2) Legacy Procedure - HCX version 4.7 - 4.9 

 
Scale up Migration Concurrency

To improve concurrent migration scalability, resources on the HCX Connector & Cloud Manager must be increased as below:

Baseline Migration Concurrency:
Supported 300 Migrations (Bulk & RAV) per HCX Manager
 
vCPU RAM (GB) Disk Size (GB) Tuning
4 12 64 N/A

Extended Migration Concurrency :
Supported 600 Migrations (Bulk & RAV) per HCX Manager
 
vCPU RAM (GB) Disk Size (GB) Tuning
32 48 300 Y


Increase resources on the HCX Connector/Cloud Manager

The following procedure must be used to increase resource allocation on HCX Connector & Cloud Manager VM both.

Requirements and Considerations before increasing resources on the HCX Connector & Cloud Manager
  • Do NOT exceed recommended allocations as that may cause the HCX Connector/Cloud Manager to malfunction.
  • Both HCX Cloud Manager and Connector must be running version HCX 4.7.0 or later.
  • There should be NO active migration or configuration workflows when making these resource changes.
  • Changes must be made during a scheduled Maintenance Window.
  • There is NO impact to Network Extension services.
  • There is NO change of concurrency for HCX vMotion/Cold Migration workflow.
  • The concurrent migration limit specified for HCX Replicated Assisted vMotion (RAV) is ONLY for Initial & Delta sync. During RAV switchover stage, only one relocation will be serviced at a time on a serial basis.
  • Additional service meshes/IX appliance should be deployed for unique workload clusters to aggregate the replication capacity of multiple IX appliances. A different Services Mesh can be deployed for each workload cluster at source and/or target.
  • If there are multiple service meshes/IX Appliances then RAV can switchover in parallel, however per SM/IX Pair it will always be sequential.
Procedure

IMPORTANT:
It is recommended to take snapshots for HCX Connector & Cloud Manager VMs prior to executing steps.

Step 1: Increase the vCPU and memory of HCX Manager to 32 and 48GB respectively. Step 2: Add a 300GB disk to HCX Connector & Cloud Manager.

IMPORTANT: Following steps can be used to add a 300GB disk to both HCX managers.Refer to Broadcom Knowledge Article 316591 for creating a new virtual disk to an existing Linux virtual machine.
  • Mount the created disk to HCX managers.
mount  /dev/sdc1 /common_ext   
df -hT 
# Check if /common_ext has been mounted and has the correct type
  • Add an entry to "/etc/fstab" to ensure mounted disk will sustain a reboot and HCX Manager upgrade.
vi /etc/fstab   
/dev/sdc1 /common_ext ext3 rw,nosuid,nodev,exec,auto,nouser,async 1 2
Note: Use Linux VI editor to edit/modify the file.
1. Press the ESC key for normal mode.
2. Press "i" Key for insert mode.
3. Press ":q!" keys to exit from the editor without saving a file.
4. Press ":wq!" keys to save the updated file and exit from the editor.

Step 3: Stop HCX services as below:
# systemctl stop postgresdb
# systemctl stop zookeeper 
# systemctl stop kafka 
# systemctl stop app-engine 
# systemctl stop web-engine 
# systemctl stop appliance-management
Step 4: Redirect existing contents under "kafka-db" and "postgres-db" to the newly created disk.
  • Move directory "/common/kafka-db" to "/common/kafka-db.bak".
cd  /common
mv kafka-db kafka-db.bak
  • Create a new directory "/common_ext/kafka-db".
cd  /common_ext
mkdir kafka-db
Note: The contents inside Kafka doesn't require to be copied and will be generated after kafka/app-engine services restart.
  • Change the ownership and permissions of this directory same as "/common/kafka-db.bak".
chmod 755 kafka-db
chown kafka:kafka kafka-db
  • Make a soft link from "/common/kafka-db" to "/common_ext/kafka-db".
cd  /common
ln -s /common_ext/kafka-db kafka-db
  • Move directory "/common/postgres-db" to "/common/postgres-db.bak" as a backup
cd  /common
mv postgres-db postgres-db.bak
  • Copy the content for directory "/common/postgres-db.bak" to "/common_ext/postgres-db" and change the ownership to postgres.
Note: Use "-R" option to change the ownership for "/common_ext/postgres-db" as below:
cp -r /common/postgres-db.bak /common_ext/postgres-db
chown -R postgres:postgres /common_ext/postgres-db
  • Make a soft link from "/common/postgres-db" to "/common_ext/postgres-db".
cd  /common
ln -s /common_ext/postgres-db postgres-db
Step 5: Start HCX services as below:
# systemctl start postgresdb
# systemctl start zookeeper
# systemctl start kafka
# systemctl start app-engine
# systemctl start web-engine
# systemctl start appliance-management

Performance Tuning on the HCX Manager

In addition to increasing HCX resources, you must perform the following tuning steps to scale concurrent migrations.
IMPORTANT: The steps performed in this procedure are not persisted after an HCX Manager upgrade.

Procedure

Step 6: Stop HCX services again.
Login to HCX Connector/Cloud Manager Root Console

# systemctl stop postgresdb
# systemctl stop zookeeper 
# systemctl stop kafka 
# systemctl stop app-engine 
# systemctl stop web-engine 
# systemctl stop appliance-management
Step 7: Increase memory page in app-engine framework.
  • Edit "app-engine-start" file to increase JAVA memory allocation and max perm size.
vi /etc/systemd/app-engine-start  
JAVA_OPTS="-Xmx4096m -Xms4096m -XX:MaxPermSize=1024m ...
Step 8: Increase thread pooling for Mobility Migration services.
  • Edit "MobilityMigrationService.zql" and "MobilityTransferService.zql" to increase thread numbers.
vi /opt/vmware/deploy/zookeeper/MobilityMigrationService.zql 
"numberOfThreads": "50",   

vi /opt/vmware/deploy/zookeeper/MobilityTransferService.zql  
"numberOfThreads":50,
Step 9: Increase message size limit for kafka framework.
  • Edit "vchsApplication.zql" and update "kafkaMaxMessageSizeBytes" from "2097152" to "4194304".
vi /opt/vmware/deploy/zookeeper/vchsApplication.zql 
"kafkaMaxMessageSizeBytes":4194304
  • Edit "kafka server.properties" and update "message.max.bytes" from "2097152" to "4194304".
vi /etc/kafka/server.properties  
message.max.bytes=4194304
Step 10: Start HCX services.
# systemctl start postgresdb
# systemctl start zookeeper 
# systemctl start kafka 
# systemctl start app-engine 
# systemctl start web-engine 
# systemctl start appliance-management
Step 11: Check the below services running in the HCX Connector/Cloud Manager:
admin@hcx [ ~ ]$ systemctl --type=service | grep "zoo\|kaf\|web\|app\|postgres"
  app-engine.service                   loaded active     running       App-Engine                                                        
  appliance-management.service         loaded active     running       Appliance Management                                              
  kafka.service                        loaded active     running       Kafka                                                             
  postgresdb.service                   loaded active     running       PostgresDB                                                                                              
  web-engine.service                   loaded active     running       WebEngine                                                         
  zookeeper.service                    loaded active     running       Zookeeper 
IMPORTANT: In the event the HCX Manager fails to reboot OR any above listed services fail to start, revert the configuration changes immediately and ensure the system comes back on-line. Additionally, Snapshots can also be used to revert the above configurations incase of any failure while applying the steps.
Note: Snapshot revert process won't restore HCX Connector/Cloud Manager's compute resources vCPU/MEM. User must follow "Step 1" to restore vCPU and memory of HCX Manager to "8" and "12GB" respectively, if needed.

 

Section 3) Recommendations operating concurrent migrations at scale (All HCX version 4.7 or newer)

  • As a best practice, use vSphere Monitoring and Performance to monitor HCX Connector & Cloud Manager CPU utilization and MEM usage.
  • Do NOT exceed the recommended limits as that could cause system instability and failed migration workflows.
  • In a scaled up environment, when migration operations are being processed, expect for the CPU utilization to increase significantly during a short periods of time and there may be a temporary delay in the UI response for migration progressing events.
  • Limit the concurrency of MON operations on target cloud when making configuration changes while having active concurrent Bulk migrations into MON enabled segments during switchover.
  • Follow the migration events and estimation on the HCX UI to determine any slowness that may be caused by the infrastructure or the network.
  • Additionally, vSphere Replication status can be monitored from the source ESXi host. Refer Broadcom Knowledge Article 323663
  • If a source ESXi host is heavily occupied from memory, I/O rate perspective, then replication performance will be affected. As a result, Bulk/RAV workflow may takes more time to complete initial base sync provided there are no slowness in the underlying datapath.
Note: In such cases, the recommendation is to relocate the source VM compute resources to another ESXi host probably a free one using native vCenter vMotion. This action won't impact ongoing replication process and do not require any changes in the migration workflow.
  • The Bulk/RAV migration workflow consists of multiple stages (i.e. initial/delta sync, off-line sync, disk consolidation, data checksum, VM instantiation, etc.) and most are not dependent of network infrastructure hence the time to complete a migration for any given VM, from start to finish, may vary depending on the conditions and it is not a simple calculation based on the size of the VM and the assumed network bandwidth.

Additional Information

Refer to HCX Configuration Limits
Refer to Network Underlay Characterization for more information about HCX dependencies on the network infrastructure between sites.
Refer to HCX Bulk Migration Operations & Best Practices
Contact your Cloud Provider regarding the availability of this procedure to scale up your cloud Data Center.
For Scale up requirements on VMConAWS Cloud, please open service request with Broadcom Support team.