Shared nothing vMotion and cross vCenter server vMotion fails for large VMs
search cancel

Shared nothing vMotion and cross vCenter server vMotion fails for large VMs

book

Article ID: 344928

calendar_today

Updated On:

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

  • Shared nothing vMotion (migration of both compute and storage) fails for very large VMs if the migration takes more than 24 hours to complete.
  • You see messages similar to the following in the vmware.log for the affected VM:
    YYYY-MM-DDTHH:MM:SS.175Z| vmx| I125: MigrateVMXdrToSpec: type: 1 srcIp=<##.##.##.##> dstIp=<##.##.##.##> mid=1f17d2f78ec07468 uuid=########-####-####-####-########176b priority=yes checksumMemory=no maxDownt
    ime=0 encrypted=0 resumeDuringPageIn=no latencyAware=yes diskOpFile= srcLogIp=<<unknown>> dstLogIp=<<unknown>> ftPrimaryIp=<<unknown>> ftSecondaryIp=<<unknown>>
    YYYY-MM-DDTHH:MM:SS.176Z| vmx| I125: MigrateSetInfo: state=8 srcIp=<##.##.##.##> dstIp=<##.##.##.##> mid=2240491300333843560 uuid=########-####-####-####-########176b priority=high
    ..
    YYYY-MM-DDTHH:MM:SS.868Z| vcpu-0| I125: MigratePlatformRestoreVnicBackingChangeOnFailure: RestoreVnicBacking-vnicBackingChange: vNicIndex 0 switchUuid 50 18 2a b7 ## ## ## ##-## ## ## ## 17 e1 b8 f8 portKey
    YYYY-MM-DDTHH:MM:SS.868Z| vmx| I125: [msg.migrate.waitdata.platform] Failed waiting for data. Error bad0003. Not found.
    YYYY-MM-DDTHH:MM:SS.868Z| vmx| I125: [vob.vmotion.dvs.state.restore.failed] vMotion migration [a27340b:2240491300333843560] failed to get DVS state in the restore phase from the source host <##.##.##.##>
    ..
    YYYY-MM-DDTHH:MM:SS.885Z| vcpu-0| I125: FILE: FileCreateDirectoryEx: Failed to create /tmp. Error = 17
    YYYY-MM-DDTHH:MM:SS.885Z| vcpu-0| I125: FILE: FileCreateDirectoryEx: Failed to create /tmp/vmware-root. Error = 17
    ..
    YYYY-MM-DDTHH:MM:SS.895Z| vcpu-0| I125: [msg.checkpoint.precopyfailure] Migration to host <##.##.##.##> failed with error Connection reset by peer (0xbad004b).
    ..
    YYYY-MM-DDTHH:MM:SS.895Z| vcpu-0| I125: [vob.migrate.net.xfer.recvfailed.status] The migration transfer failed during the receive operation to socket 4311686C4AE0: received 0/36 bytes: Connection reset by peer.
    YYYY-MM-DDTHH:MM:SS.895Z| vcpu-0| I125: [vob.vmotion.stream.keepalive.read.fail] vMotion migration [a27340b:2240491300333843560] failed to read stream keepalive: Connection reset by peer

     

  • You see messages similar to the following in the vmkernel.log on the source ESXi host:

YYYY-MM-DDTHH:MM:SS.756Z cpu14:5749041)VMotion: 5417: 2240491300333843560 S: Estimated network bandwidth 129.844 MB/s during disk copy.
YYYY-MM-DDTHH:MM:SS.867Z cpu73:6022824)WARNING: VMotionUtil: 862: 2240491300333843560 S: failed to read stream keepalive: Connection reset by peer
YYYY-MM-DDTHH:MM:SS.868Z cpu73:6022824)WARNING: Migrate: 282: 2240491300333843560 S: Failed: Connection reset by peer (0xbad004b) @0x418022f0f9f3
YYYY-MM-DDTHH:MM:SS.895Z cpu70:5749046)WARNING: Migrate: 6145: 2240491300333843560 S: Migration considered a failure by the VMX. It is most likely a timeout, but check the VMX log for the true error.

  • You see messages similar to the following in the vmkernel.log on the destination ESXi host:

YYYY-MM-DDTHH:MM:SS.836Z cpu0:2242333)VMotionRecv: 693: 2240491300333843560 D: Estimated network bandwidth 129.855 MB/s during disk copy.
YYYY-MM-DDTHH:MM:SS.837Z cpu20:2242332)WARNING: VMotionSend: 3618: 2240491300333843560 D: failed to get DVS state in the restore phase from the source host <##.##.##.##>
YYYY-MM-DDTHH:MM:SS.837Z cpu20:2242332)WARNING: VMotionSend: 5923: 2240491300333843560 D: Failed handling message reply GET_DVS_STATE: Not found
YYYY-MM-DDTHH:MM:SS.837Z cpu20:2242332)WARNING: Migrate: 282: 2240491300333843560 D: Failed: Not found (0xbad0003) @0x41801c6c4bb2
YYYY-MM-DDTHH:MM:SS.868Z cpu64:2242306)WARNING: Migrate: 6145: 2240491300333843560 D: Migration considered a failure by the VMX. It is most likely a timeout, but check the VMX log for the true error.
YYYY-MM-DDTHH:MM:SS.868Z cpu64:2242306)WARNING: VMotion: 565: 2240491300333843560 D: Storage stream IO error: 458752

  •  Similar Issue can occur during Cross vCenter vMotion (XVM) of large VM fails with error : Invalid configuration for device 'X'

Destination vc vpxd log:

YYYY-MM-DDTHH:MM:SS.836Z error vpxd[####] [Originator@6876 sub=VmProv opID=#######] Exception while executing action vpx.vmprov.CreateDestinationVm: N3Vim5Fault17InvalidDeviceSpec9ExceptionE(Fault cause: vim.fault.InvalidDeviceSpec)

Destination host hostd log:

YYYY-MM-DDTHH:MM:SS.836Z Er(163) Hostd####]: [Originator@6876 sub=######.vmx opID=#######
##### sid=####user=vpxuser:<no user>] Device spec doesn't match up with dvport/dvpg configuration

Source VC vpxd log:

YYYY-MM-DDTHH:MM:SS.836Z error vpxd[####] [Originator@6876 sub=VmProv opID=#######] Get exception while executing action vpx.vmprov.CreateDestinationVm:
            --> (vim.fault.InvalidDeviceSpec) {
            -->    property = "virtualDeviceSpec.device.backing",
            -->    deviceIndex = 118,
            -->    msg = "Invalid configuration for device 'XX'.",

 

Environment

VMware vCenter Server 6.5.x
VMware vCenter Server 6.7.x
VMware vCenter Server 7.0.x
VMware vCenter Server 8.0.x
VMware vSphere ESXi 6.5.x
VMware vSphere ESXi 6.7.x
VMware vSphere ESXi 7.0.x
VMware vSphere ESXi 8.0.x

Cause

  • Failure of such migrations is likely due to a missing port on the destination ESXi host.
  • For any migration, vCenter Server reserves a DVS port for 24 hours. After 24 hours, the port reservation expires and the port is deleted by the DvsMonitor, hence the VM migrated to the destination host has no port to connect to and the migration fails.

Resolution

This is default and expected behavior.

To workaround this behavior, the port reservation timeout needs to be extended to allow the migration to complete:

  1. SSH to the vCenter Server appliance
  2. Make a copy/backup of file /etc/vmware-vpx/vpxd.cfg:

cp /etc/vmware-vpx/vpxd.cfg /etc/vmware-vpx/vpxd.cfg.bak

     3. Add the following stanza to the file:

<vpxd>
  <dvs>
    <PortReserveTimeoutInMin>7200</PortReserveTimeoutInMin>
  </dvs>
</vpxd>

Note: This will extend port reservation timeout to 5 days (24*60*5=7200), and it should be enough to cover any time-demanding (slow/large VM (50TB+)) share-nothing vMotion.

     4. Restart the vpxd service: service-control --restart vmware-vpxd

     5. Re-try the vMotion operation on the affected VMs.

     6. Once the VMs have been migrated, revert the vpxd.cfg file back to its original form:

cp /etc/vmware-vpx/vpxd.cfg.bak /etc/vmware-vpx/vpxd.cfg

     7. Restart the vpxd service: service-control --restart vmware-vpxd