HCX Bulk Migration operations and best practices

search cancel

HCX Bulk Migration operations and best practices

book

Article ID: 323663

calendar_today

Updated On: 03-20-2025

Products

VMware HCX VMware Cloud on AWS

Issue/Introduction

Introduction

The bulk migration capability of HCX uses vSphere replication (vSR) to migrate disk files while re-creating the VM on the destination vSphere HCX instance.

How it works

HCX Bulk Migration has following stages:

Placeholder disk creation: At the target vCenter Server, on the specified datastore, placeholder disks are created to enable replication of the source VM disk's data to these placeholder disks.
Note: During early stage of migration for a given VM, the empty disks (placeholder) are created by HCX on the target side. Whereas, config files (cfg) are created by host based replication (HBR).
Push LWD config: On the source and target IX appliances, LWD configuration rules are pushed to enable HBR server (also known as VR server) inside the target IX appliance to accept and process vSphere Replication traffic for sending towards the target ESX host via NFC connection on port 902.
Note: In the case of reverse bulk migration from Cloud to OnPrem, the HBR server is enabled on the OnPrem IX appliance.
Enable replication: At the source side, HCX signals vCenter Server to enable replication. The replication status can be verified using the following command in and ESX root shell:
vim-cmd vmsvc/getallvms
vim-cmd hbrsvc/vmreplica.getState <VM ID>
Start Full sync/Base sync: Once replication gets enabled on the source ESX/HBR, a full sync event on the source vCenter Server starts replicating data to the target via the HCX-IX appliance.
RPO cycle/Delta Sync: After completion of the initial base sync, an RPO cycle of 2 hours is set to perform the delta sync.
Note: Depending upon data churn on the source disk, additional snapshots are being created during the RPO cycle.
Note: After each RPO cycle, disk consolidation takes place and creates a “hbrdisk.RDID vmdk” called as replica instance vmdk file on the target datastore.
CONT Replication: If the migration switchover is planned for a schedule MW, the Delta Sync keeps running every 2 hours until it reaches scheduled time.
Note: If you want to modify the switchover schedule, you can change it from the migration wizard during runtime.
Note: You may perform a force switchover immediately by selecting the "Ignore failover window and start migration as soon as possible." option under "Schedule failover window for migrations" tab, but it will wait for any ongoing replication transfer to be completed before transitioning to the cutover stage.
Switchover: Post completion of the initial or full sync, an image is created on the target side and switchover is triggered automatically unless a specific schedule is set, as described previously in this article.
Note: Image constitutes using VMX & NVRAM including VMXF(if applicable).

Switchover has following tasks performed in the backend:
1. Power off source VM: To perform the offline sync, HCX signals the source vCenter Server to power off the VM which is required to stop further data churn on the source VM.
2. Offline Sync: The replica instance vmdk files are consolidated(deleted) which is a time taking process and depends upon the target vCenter Server infrastructure which can't be predicted and mostly unrelated to HCX bulk migration workflow.
  For more information, please see HCX - Bulk migration task: "Offline sync started on source VM" is taking longer than expected for some VMs.
3. Instantiate VM: After successful completion of the offline sync, the VMX/VMXF and NVRAM files are copied from the HBR config files to the target datastore to be used for instantiating the VM.
Clean up: Upon successful instantiation of the VM on the target side, the migration workflow is transitioned into a clean up workflow to remove all instances and configurations corresponding to the migration and transfer for a given VM.
1. The LWD config is removed from the source/target IX appliances.
2. The disable replication task is performed on the source ESX host corresponding to a given Virtual Machine.
3. The network is disconnected and the VM is renamed for backup at the source side.

Log analysis during bulk migration workflow

HCX: To monitor the events:

Go to HCX source/target migration wizard >> Migration Management page(Mobility Groups) >> Migration >> Events
Go to the HCX source/target admin shell >> /common/log/admin/app.log,

Go to the HCX IX appliance shell using ccli >>

var/log/vmware/hbrsrv.log

Note: You can also go to /tmp/Fleet-appliances/<Service-Mesh>/<IX-Appliance>/var/log/vmware/hbrsrv.log from the HCX tech bundle.
Note: For forward migration, look for hbrsrv.log from the target/cloud IX appliance.

Empty disks are created on the target:

2022-02-17T15:16:18.152Z info hbrsrv[6AB2AD852700] [Originator@6876 sub=Host opID=hs-285dd448] Getting disk type for /vmfs/volumes/vsan:UUID/VM_UUID/VM_NAME
2022-02-17T15:16:18.202Z info hbrsrv[6AB2AD956700] [Originator@6876 sub=Host opID=hs-4ac4bf6f] Getting disk type for /vmfs/volumes/vsan:UUID/VM_UUID/VM_NAME

Got the Disk RDID from the source ESX HBR:

2022-02-17T15:17:18.919Z info hbrsrv[6AB2AD956700] [Originator@6876 sub=Delta] Configured disks for group VRID-######:
2022-02-17T15:17:18.919Z info hbrsrv[6AB2AD956700] [Originator@6876 sub=Delta] RDID-######
2022-02-17T15:17:18.919Z info hbrsrv[6AB2AD956700] [Originator@6876 sub=Delta] RDID-######

Indication of Full Sync completion:

2022-02-17T15:17:33.881Z info hbrsrv[6AB2ADA9B700] [Originator@6876 sub=Delta opID=hsl-10579a55] Full sync complete for disk RDID-####### (198057984 bytes transferred, 209715200 bytes checksummed)
2022-02-17T15:17:55.078Z info hbrsrv[6AB2AD8D4700] [Originator@6876 sub=Delta opID=hsl-1057c4a8] Full sync complete for disk RDID-####### (827564032 bytes transferred, 838860800 bytes checksummed)

Image creation

2022-02-17T15:20:04.403Z info hbrsrv[6AB2AD9D8700] [Originator@6876 sub=Delta opID=hsl-1057c4bc] Instance complete for disk RDID-#######
2022-02-17T15:20:04.738Z info hbrsrv[6AB2AD70D700] [Originator@6876 sub=Delta opID=hsl-1057c4ee] Instance complete for disk RDID-#######
2022-02-17T15:20:14.508Z info hbrsrv[6AB2ADBE0700] [Originator@6876 sub=Image opID=hs-4f3c1b62:hs-d5da:hs-4252] Creating image from group VRID-########, instance 49, in #######

Creation of replica disks

2022-02-17T15:20:14.526Z info hbrsrv[6AB2ADBE0700] [Originator@6876 sub=Host opID=hs-4f3c1b62:hs-d5da:hs-4252] Getting disk type for /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-########.vmdk
2022-02-17T15:20:14.822Z info hbrsrv[6AB2ADBE0700] [Originator@6876 sub=Host opID=hs-4f3c1b62:hs-d5da:hs-4252] Getting disk type for /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-########.vmdk

VMX/VMXF and NVRAM file download event

2022-02-17T15:20:15.123Z info hbrsrv[6AB2ADBE0700] [Originator@6876 sub=Image opID=hs-4f3c1b62:hs-d5da:hs-4252] Copying cfg /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-########.vmdkvmx.137 to /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-########.vmdk.vmx
2022-02-17T15:20:15.410Z info hbrsrv[6AB2ADBE0700] [Originator@6876 sub=Image opID=hs-4f3c1b62:hs-d5da:hs-4252] Copying cfg /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-########.vmdk.vmxf.138 to /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-########.vmdk.vmxf
2022-02-17T15:20:15.430Z info hbrsrv[6AB2ADBE0700] [Originator@6876 sub=Image opID=hs-4f3c1b62:hs-d5da:hs-4252] Copying cfg /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-########.vmdk.nvram.139 to /vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-########.vmdk.nvram

Replica disks consolidation/deletion

2022-02-17T15:20:32.891Z info hbrsrv[6AB2AD70D700] [Originator@6876 sub=PersistentCleanup opID=hs-565f4eb] The disk '/vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-########.vmdk' (key=186) was cleaned up successfully.
2022-02-17T15:20:33.004Z info hbrsrv[6AB2AD70D700] [Originator@6876 sub=PersistentCleanup opID=hs-565f4eb] The disk '/vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-########.vmdk' (key=187) was cleaned up successfully.

Hbrcfg file deletion

2022-02-17T15:20:33.148Z info hbrsrv[6AB2AD70D700] [Originator@6876 sub=PersistentCleanup opID=hs-565f4eb] The file '/vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-########.vmx.137' (key=189) was cleaned up successfully.
2022-02-17T15:20:33.220Z info hbrsrv[6AB2AD70D700] [Originator@6876 sub=PersistentCleanup opID=hs-565f4eb] The file '/vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-########.vmxf.138' (key=190) was cleaned up successfully.
2022-02-17T15:20:33.291Z info hbrsrv[6AB2AD70D700] [Originator@6876 sub=PersistentCleanup opID=hs-565f4eb] The file '/vmfs/volumes/vsan:UUID/VM_UUID/hbrdisk.RDID-########.nvram.139' (key=191) was cleaned up successfully.

ESX HBR: Run vim-cmd CLIs to verify the status of replication.

Get all VM details and track specific VM ID

[root@cia-vmc-esx-015:~] vim-cmd vmsvc/getallvms                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
852    VM_Name                                                                                 rhel6_64Guest           vmx-14

Get replication state for VM ID 852 from source ESX/HBR

vim-cmd hbrsvc/vmreplica.getState 852
Retrieve VM running replication state:
(vim.fault.ReplicationVmFault) {
   faultCause = (vmodl.MethodFault) null, 
   faultMessage = <unset>, 
   reason = "notConfigured", 
   state = <unset>, 
   instanceId = <unset>, 
   vm = 'vim.VirtualMachine:852'
   msg = "Received SOAP response fault from [<cs p:000000081ef64380, TCP:localhost:8307>]: getGroupState
vSphere Replication operation error: Virtual machine is not configured for replication."
}

vim-cmd hbrsvc/vmreplica.getState 852
Retrieve VM running replication state:
	The VM is configured for replication. Current replication state: Group: VRID-##### (generation=32459820918756983)
	Group State: full sync (74% done: checksummed 614 MB of 1000 MB, transferred 569.3 MB of 593.8 MB)
		DiskID RDID-###### State: full sync (checksummed 414 MB of 800 MB, transferred 380.4 MB of 404.9 MB)
		DiskID RDID-###### State: inactive

vim-cmd hbrsvc/vmreplica.getState 852
Retrieve VM running replication state:
	The VM is configured for replication. Current replication state: Group: VRID-##### (generation=32459820918756983)
	Group State: lwd delta (instanceId=replica-########) (0% done: transferred 0 bytes of 40 KB)
		DiskID RDID-####### State: lwd delta (transferred 0 bytes of 40 KB)
		DiskID RDID-####### State: lwd delta (transferred 0 bytes of 0 bytes)

Target vCenter Server: Check the VM datastore file location and verify the status of hbrdisk & hbrcfg files.
- Placeholder disks are created by HCX
- Config files (nvram,vmx &vmxf) are created by HBR
- There may be multiple instances of images created on the target Datastore depending upon snapshots on the source VM, but those will be downloaded and consolidated during the VM instantiation.
- Post successful consolidation of disks, only the original images and vmdk files will be retained on the target Datastore.

Environment

VMware HCX

Resolution

Migration Best Practices

Use Bulk migration with "Seed Checkpoint" enabled which is available from HCX 4.1.0 release onwards,

Note: In the event that the bulk migration workflow fails and rolls back at the cut over stage, when rescheduling the migrations the workflow will try to reuse the seed data already copied in the previous attempt.
Note: The recommendation is to not perform a cleanup operation of a failed job which will lead to the removal of the seed data.

Quiesce the VM before scheduling the migration to minimize data churn.
Migrate the VM by itself so all infrastructure resources are dedicated to that single workflow.
Ensure there is sufficient space in the target datastore. Up to 20% extra space may be used temporarily during the migration.
Follow the migration events and estimation on the HCX UI to determine any slowness that may be caused by the infrastructure or the network.
Additionally, vSphere Replication status can be monitored from the source ESXi host as described in previous sections.
If a source ESXi host is heavily occupied from memory, I/O rate perspective, then replication performance will be affected. As a result, bulk migration workflow may take more time to complete the initial base sync provided there is no slowness in the underlying datapath.

Note: In such cases, the recommendation is to relocate the source VM compute resources to another ESXi host (probably a free one) using vCenter vMotion. This action won't impact ongoing replication processes and do not require any changes in the migration workflow.

The switchover window should be over-estimated to accommodate for the lengthy data checksum process and instantiation of the VM on the target
Co-location with other VMs must be planned accordingly to accommodate for the expected downtime in services, so the migration workflow can be committed to completion.
In certain cases, the bulk migration workflow may take more time to complete the switchover stage, for example, when an extremely large VM is migrated using HCX.
Do not perform app/web engine restart from the source or target HCX managers during the course of an ongoing migration as it may impact migration workflows at a given point in time.
Do not power off the source VM manually after completion of the initial base sync as it may impact the offline sync workflow at a given point in time.
In the event that VM takes longer to shutdown, or cannot be shutdown gracefully from the guest OS, the recommendation is to enable "Force Power Off" upon scheduling the migrations. Refer to HCX - Bulk Migration may fail due to "Invalid Power State" of VM.

Alternatively:

The VM can be migrated using Cold migration to ensure completion, despite the service disruption.
If the bulk migration fails, the recommendation is for to use DR instead and ensure the protection is completed before manually triggering recovery to bring up the VM instance on the target site.
DR Protection Recovery can be used only as a last resort, to fail over all required VMs once those are protected.

Note: DR Protection Recovery would be a more manual and lengthy process but with a higher chance for success given any infrastructure and network limitations.

IMPORTANT: A migration cannot be guaranteed under ANY circumstances therefore these and other considerations must be taken to maximize the possibilities for a successful migration under those conditions, by minimizing the impact of infrastructure and network limitations.

Feedback

thumb_up Yes

thumb_down No