HCX - RAV/vMotion migration failure for busy VMs - "The migration was canceled because the amount of changing memory for the virtual machine was greater than the available network bandwidth"
search cancel

HCX - RAV/vMotion migration failure for busy VMs - "The migration was canceled because the amount of changing memory for the virtual machine was greater than the available network bandwidth"

book

Article ID: 368043

calendar_today

Updated On:

Products

VMware HCX

Issue/Introduction

HCX RAV migrations failed for busy VMs with the error "The migration was canceled because the amount of changing memory for the virtual machine was greater than the available network bandwidth. Attempt the migration again when the virtual machine is not as busy or more network bandwidth is available."




HCX connector logs - look for the unique migration ID to identify the error. 


/common/logs/admin/app.log 

[YYYY-MM-DDTHH:MM:SS] UTC [RAVService_SvcThread-17219, Ent: HybridityAdmin, , TxId: ########-####-####-####-########5bf5] ERROR c.v.h.s.rav.jobs.RAVSwitchoverJob- [migId=########-####-####-####-########32d5] Error while executing RAVSwitchoverJob state 'VERIFY_RESULT'.
com.vmware.vchs.hybridity.migration.common.MigrationException: vMotion failed. System Error. Source side error is : Source side relocate failed for the virtual machine. The migration was canceled because the amount of changing memory for the virtual machine was greater than the available network bandwidth. Attempt the migration again when the virtual machine is not as busy or more network bandwidth is available. msg.checkpoint.precopyfailure.noforwardprogress:The migration was canceled because the amount of changing memory for the virtual machine was greater than the available network bandwidth. Attempt the migration again when the virtual machine is not as busy or more network bandwidth is available.
Target side error is : A general system error occurred: vMotion failed: unknown error msg.migrate.waitdata.platform:Failed waiting for data. Error 195887167. Connection closed by remote host, possibly due to timeout. vob.vmotion.recv.connection.closed:vMotion migration [177482127:204452220
0352186763] the remote host closed the connection unexpectedly and migration has stopped. The closed connection probably results from a migration failure detected on the remote host. faultTime:2024-05-03T22:56:38.432535Z

 

From the vmware.log of the VM 

[YYYY-MM-DDTHH:MM:SS]| vcpu-0| I125: MigrateSetState: Transitioning from state 3 to 6.
[YYYY-MM-DDTHH:MM:SS]| vcpu-0| A100: ConfigDB: Setting config.readOnly = "FALSE"
[YYYY-MM-DDTHH:MM:SS]| vcpu-0| I125: Migrate_SetFailureMsgList: switching to new log file.
[YYYY-MM-DDTHH:MM:SS]| vcpu-0| I125: FILE: FileCreateDirectoryEx: Failed to create /tmp. Error = 17
[YYYY-MM-DDTHH:MM:SS]| vcpu-0| I125: FILE: FileCreateDirectoryEx: Failed to create /tmp/vmware-root. Error = 17
[YYYY-MM-DDTHH:MM:SS]| vcpu-0| I125: Migrate_SetFailureMsgList: Now in new log file.
[YYYY-MM-DDTHH:MM:SS]| vcpu-0| I125: Migrate: Caching migration error message list:
[YYYY-MM-DDTHH:MM:SS]| vcpu-0| I125: [msg.checkpoint.precopyfailure.noforwardprogress] The migration was canceled because the amount of changing memory for the virtual machine was greater than the available network bandwidth. Attempt the migration again when the virtual machine is not as busy or more network bandwidth is available.
[YYYY-MM-DDTHH:MM:SS]| vcpu-0| I125: Migrate: cleaning up migration state.
[YYYY-MM-DDTHH:MM:SS]| vcpu-0| I125: SVMotion: Enter Phase 12
[YYYY-MM-DDTHH:MM:SS]| vcpu-0| I125: SVMotion_Cleanup: Scheduling cleanup thread.
[YYYY-MM-DDTHH:MM:SS]| worker-51785152| I125: SVMotionThreadCompleteMigration: Pausing SvMotion thread pending checkpoint save success...
[YYYY-MM-DDTHH:MM:SS]| vcpu-0| I125: Msg_Post: Error
[YYYY-MM-DDTHH:MM:SS]| vcpu-0| I125: [msg.checkpoint.precopyfailure.noforwardprogress] The migration was canceled because the amount of changing memory for the virtual machine was greater than the available network bandwidth. Attempt the migration again when the virtual machine is not as busy or more network bandwidth is available.
[YYYY-MM-DDTHH:MM:SS]| worker-51785152| I125: SVMotionThreadCompleteMigration: Save callback received. Resuming SvMotion failure activities.
[YYYY-MM-DDTHH:MM:SS]| worker-51785152| I125: SVMotionThreadCompleteMigration: Preparing to close disks...
[YYYY-MM-DDTHH:MM:SS]| vcpu-0| I125: ----------------------------------------
[YYYY-MM-DDTHH:MM:SS]| worker-51784311| I125: SVMotionCleanupThread: Waiting for SVMotion Bitmap thread to complete.
[YYYY-MM-DDTHH:MM:SS]| worker-51785152| W115: SVMotionThreadCompleteMigration: Failed while informing vmkernel of disk close: Not found
[YYYY-MM-DDTHH:MM:SS]| vcpu-0| W115: Migrate: secondary failure during migration: error Migration failed due to lack of pre-copy forward progress.
[YYYY-MM-DDTHH:MM:SS]| worker-51784311| I125: SVMotionCleanupThread: Waiting for SVMotion thread to complete.
[YYYY-MM-DDTHH:MM:SS]| worker-51785152| I125: SVMotionCopyThread: Waiting for SVMotion Bitmap thread to complete before issuing a stun during migration failure cleanup.
[YYYY-MM-DDTHH:MM:SS]| worker-51785152| I125: SVMotion: FailureCleanup thread completes.
[YYYY-MM-DDTHH:MM:SS]| worker-51784311| I125: SVMotionCleanupThread: Waiting for final stun/unstun to finish
[YYYY-MM-DDTHH:MM:SS]| vmx| I125: SVMotion: Worker thread performing SVMotionCopyThreadDone exited.
[YYYY-MM-DDTHH:MM:SS]| vmx| I125: MigrateSetStateFinished: type=1 new state=6
[YYYY-MM-DDTHH:MM:SS]| vmx| I125: MigrateSetState: Transitioning from state 6 to 6.
[YYYY-MM-DDTHH:MM:SS]| vmx| A100: ConfigDB: Setting config.readOnly = "FALSE"
[YYYY-MM-DDTHH:MM:SS]| vmx| I125: Migrate_SetFailureMsgList: switching to new log file.
[YYYY-MM-DDTHH:MM:SS]| vmx| I125: FILE: FileCreateDirectoryEx: Failed to create /tmp. Error = 17
[YYYY-MM-DDTHH:MM:SS]| vmx| I125: FILE: FileCreateDirectoryEx: Failed to create /tmp/vmware-root. Error = 17
[YYYY-MM-DDTHH:MM:SS]| vmx| I125: Migrate_SetFailureMsgList: Now in new log file.
[YYYY-MM-DDTHH:MM:SS]| vmx| I125: Migrate: Caching migration error message list:
[YYYY-MM-DDTHH:MM:SS]| vmx| I125: [msg.svmotion.fail.internal] A fatal internal error occurred. See the virtual machine's log for more details.
[YYYY-MM-DDTHH:MM:SS]| vmx| I125: Migrate: cleaning up migration state.

[YYYY-MM-DDTHH:MM:SS]| vcpu-0| I125: VMMon_VSCSIStopVports: No such target on adapter
[YYYY-MM-DDTHH:MM:SS]| vcpu-0| W115: Mirror_DisconnectMirrorNode: Some guest IOs failed for device ########-######ffffffffff-svmmirror during disk copy: I/O error. Failing storage vMotion.

 

ESXi - vmkernel.log 

[YYYY-MM-DDTHH:MM:SS] cpu21:51784291)VMotion: 5321: 8783550563599717932 S: Stopping pre-copy: not enough forward progress (Pages left to send: prev2 27749386, prev 22649387, cur 18638868, pages dirtied by pass through device 0 network bandwidth ~12.8$
[YYYY-MM-DDTHH:MM:SS] cpu21:51784291)WARNING: VMotion: 5345: 8783550563599717932 S: Canceling VMotion: sending remaining pages will take 6237.734 seconds, which is greater than the maximum allowable swithover time of 100 seconds.
[YYYY-MM-DDTHH:MM:SS] cpu21:51784291)WARNING: Migrate: 282: 8783550563599717932 S: Failed: Migration failed due to lack of pre-copy forward progress (0xbad0109) @0x41801eb77d5d
[YYYY-MM-DDTHH:MM:SS] cpu36:51785155)WARNING: Migrate: 6145: 8783550563599717932 S: Migration considered a failure by the VMX. It is most likely a timeout, but check the VMX log for the true error.

Environment

HCX

Cause

The final stage of the RAV migration relies on vMotion. To complete this step, HCX will stun the VM momentarily and complete the migration.

When the VM is busy and can not copy the data within the 100ms allowed switchover time, the process will be marked as a failure. 

Similar to a normal vMotion - https://knowledge.broadcom.com/external/article/332734/  


Resolution

  • Migrate the VM during a less busy time window or when more network bandwidth is available.
  • If the customer has a dedicated vMotion vNIC in the HCX Network Profile at the connector side, ensure that the vMotion traffic is not congested or de-prioritized due to shared links.
  • Ensure optimal network performance between sites. 
  • Before the scheduled switchover, when possible stop the application. 
  • Use a different migration option, such as bulk migration, to migrate the VM. 

Additional Information

In some cases, this fails for every VM, so further investigation needs to be carried out on the network jitter.