Shared nothing vMotion fails for large VMs

Products

VMware vCenter Server VMware vSphere ESXi

Issue/Introduction

Symptoms:
Shared nothing vMotion (migration of both compute and storage) fails for very large VMs if the migration takes more than 24 hours to complete.

vMotion logs in VM's vmware.log logs may looks similar to:
2021-04-22T10:40:52.175Z| vmx| I125: MigrateVMXdrToSpec: type: 1 srcIp=<10.10.10.10> dstIp=<10.10.10.11> mid=1f17d2f78ec07468 uuid=d9e1a354-aead-11e9-bf2c-0a94ef93176b priority=yes checksumMemory=no maxDownt
ime=0 encrypted=0 resumeDuringPageIn=no latencyAware=yes diskOpFile= srcLogIp=<<unknown>> dstLogIp=<<unknown>> ftPrimaryIp=<<unknown>> ftSecondaryIp=<<unknown>>
2021-04-22T10:40:52.176Z| vmx| I125: MigrateSetInfo: state=8 srcIp=<10.10.10.10> dstIp=<10.10.10.11> mid=2240491300333843560 uuid=d9e1a354-aead-11e9-bf2c-0a94ef93176b priority=high
..
2021-04-23T16:35:10.868Z| vcpu-0| I125: MigratePlatformRestoreVnicBackingChangeOnFailure: RestoreVnicBacking-vnicBackingChange: vNicIndex 0 switchUuid 50 18 2a b7 15 19 16 7b-be 5e 0d 3f 17 e1 b8 f8 portKey
2021-04-23T16:35:10.868Z| vmx| I125: [msg.migrate.waitdata.platform] Failed waiting for data. Error bad0003. Not found.
2021-04-23T16:35:10.868Z| vmx| I125: [vob.vmotion.dvs.state.restore.failed] vMotion migration [a27340b:2240491300333843560] failed to get DVS state in the restore phase from the source host <10.10.10.10>
..
2021-04-23T16:35:10.885Z| vcpu-0| I125: FILE: FileCreateDirectoryEx: Failed to create /tmp. Error = 17
2021-04-23T16:35:10.885Z| vcpu-0| I125: FILE: FileCreateDirectoryEx: Failed to create /tmp/vmware-root. Error = 17
..
2021-04-23T16:35:10.895Z| vcpu-0| I125: [msg.checkpoint.precopyfailure] Migration to host <10.10.10.11> failed with error Connection reset by peer (0xbad004b).
..
2021-04-23T16:35:10.895Z| vcpu-0| I125: [vob.migrate.net.xfer.recvfailed.status] The migration transfer failed during the receive operation to socket 4311686C4AE0: received 0/36 bytes: Connection reset by peer.
2021-04-23T16:35:10.895Z| vcpu-0| I125: [vob.vmotion.stream.keepalive.read.fail] vMotion migration [a27340b:2240491300333843560] failed to read stream keepalive: Connection reset by peer

vmkernel.log on the source ESXi host:
2021-04-23T16:35:10.756Z cpu14:5749041)VMotion: 5417: 2240491300333843560 S: Estimated network bandwidth 129.844 MB/s during disk copy.
2021-04-23T16:35:10.867Z cpu73:6022824)WARNING: VMotionUtil: 862: 2240491300333843560 S: failed to read stream keepalive: Connection reset by peer
2021-04-23T16:35:10.868Z cpu73:6022824)WARNING: Migrate: 282: 2240491300333843560 S: Failed: Connection reset by peer (0xbad004b) @0x418022f0f9f3
2021-04-23T16:35:10.895Z cpu70:5749046)WARNING: Migrate: 6145: 2240491300333843560 S: Migration considered a failure by the VMX. It is most likely a timeout, but check the VMX log for the true error.

vmkernel.log on the destination ESXi host:
2021-04-23T16:35:10.836Z cpu0:2242333)VMotionRecv: 693: 2240491300333843560 D: Estimated network bandwidth 129.855 MB/s during disk copy.
2021-04-23T16:35:10.837Z cpu20:2242332)WARNING: VMotionSend: 3618: 2240491300333843560 D: failed to get DVS state in the restore phase from the source host <10.10.10.10>
2021-04-23T16:35:10.837Z cpu20:2242332)WARNING: VMotionSend: 5923: 2240491300333843560 D: Failed handling message reply GET_DVS_STATE: Not found
2021-04-23T16:35:10.837Z cpu20:2242332)WARNING: Migrate: 282: 2240491300333843560 D: Failed: Not found (0xbad0003) @0x41801c6c4bb2
2021-04-23T16:35:10.868Z cpu64:2242306)WARNING: Migrate: 6145: 2240491300333843560 D: Migration considered a failure by the VMX. It is most likely a timeout, but check the VMX log for the true error.
2021-04-23T16:35:10.868Z cpu64:2242306)WARNING: VMotion: 565: 2240491300333843560 D: Storage stream IO error: 458752

Environment

VMware vCenter Server 6.5.x
VMware vSphere ESXi 7.0.x
VMware vCenter Server 6.7.x
VMware vCenter Server 7.0.x
VMware ESXi 6.5.x
VMware ESXi 6.7.x

Cause

Failure of such migrations is due to missing port on the destination ESXi host.
For any migrations, vCenter reserves a dvs port for 24 hours. After 24 hours, port reservation expires and the port is deleted by the DvsMonitor, hence the VM migrated to the destination host has no port to connect to and the migration fails.

Resolution

This is default and expected behaviour.
Please see the workaround section for a workaround.

Workaround:
To workaround this issue, port reservation timeout needs to be extended to allow the migration to complete:

1. SSH to the vCenter.
2. Make a copy/backup of file /etc/vmware-vpx/vpxd.cfg:

cp /etc/vmware-vpx/vpxd.cfg /etc/vmware-vpx/vpxd.cfg.bak

3. Insert the following section to vpxd.cfg:
<vpxd>
<dvs>
<PortReserveTimeoutInMin>7200</PortReserveTimeoutInMin>
</dvs>
<cert>
This will extend port reservation timeout to 5 days (24*60*5=7200), and it should be enough to cover any time-demanding (slow/large VM (50TB+)) share-nothing vMotion.
4. Restart vpxd service:

service-control –-restart vmware-vpxd

5. Re-try storage vMotion of your large VM(s).
6. Once the VMs have been migrated, revert the vpxd.cfg file back to its original form:

cp /etc/vmware-vpx/vpxd.cfg.bak /etc/vmware-vpx/vpxd.cfg

7. Restart vpxd service:

service-control --restart vmware-vpxd