VM migration via Workload Placement failing due to vNIC connection state mismatch and timeouts
search cancel

VM migration via Workload Placement failing due to vNIC connection state mismatch and timeouts

book

Article ID: 418668

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite) VMware vCenter Server 8.0

Issue/Introduction

Automated Virtual Machine (VM) migrations initiated through Workload Placement (WLP) fail in Aria Operations 8.18.3 environments integrating with vCenter Server 8.0.3. Manual VM migrations for the same virtual machines complete successfully.
 
This issue occurs after upgrading vCenter Server from 7.0.3 to 8.0.3.
 
Users observe the following errors:
  1. Failed waiting for data. Error 195887107. Not found.
  2. vMotion migration […] failed to read stream keepalive: Connection closed by remote host, possibly due to timeout.
  3. Migration to host […] failed with error Connection closed by remote host, possibly due to timeout (195887167).
  4. vMotion migration […] failed to get DVS state in the restore phase from the source host.
  5. vim.fault.GenericVmConfigFault: vMotion migration [...] failed to get DVS state in the restore phase from the source host.

Environment

Aria Operations 8.18.3
vCenter server 8.0.3

Cause

  • ESXi Datapath Change: A change is introduced in ESXi 8.0 Update 1 (build 21495797). This update adds a strict check for the vnicIsConnected field within the vnicBackingChange specification during vMotion operations.
  • Aria Operations Client Behavior: Aria Operations 8.18.3, acting as a vMotion client, is not consistently populating the Connectable object for each VirtualDeviceSpec in the device change list when invoking vim.vm.RelocateVM_Task API calls.
  • Incorrect vnicIsConnected Interpretation: When the Connectable object is not explicitly set by the client, the ESXi vmkernel internally defaults vnicIsConnected to false. This misrepresentation of the vNIC's actual connected state lead to the vMotion failure during the Distributed Virtual Switch (DVS) state restoration phase, as the destination host receivs an erroneous "disconnected" signal.
  • Timeout for Large VMs: For larger VMs (e.g., an ~8 TB VM), the default Workload Placement timeout settings (60 minutes for single resource actions, 90 minutes for bulk actions) is insufficient, leading to "Connection closed by remote host" errors due to the prolonged migration process.

Resolution

Follow the steps below to resolve this issue:
  • Apply HotFix: Apply HotFix 8 (HF8) for Aria Operations (patchId=16006). This hotfix modifies Aria Operations' behavior to correctly populate the connectable object and its connected status during vMotion API calls, aligning with the new ESXi 8.0 Update 1 datapath requirements.

  • Adjust Timeout Configuration: If migrations continue to fail for large VMs due to timeouts, increase the global timeout settings for Workload Placement actions via the Swagger UI API:
    • Access the Aria Operations Swagger UI.
    • Locate the Workload Placement configuration endpoints.
    • Update the following settings to 1440 minutes (24 hours):
      • PER_RESOURCE_ACTION_TIME_OUT
      • PER_ACTION_TIME_OUT
    • Verify the changes in Administration -> Control Panel -> Audit -> User Activity Audit.
Note: There is a proposal for VM exclusion logic for extremely large VMs (e.g., 8 TB) in future releases.

Additional Information

Impact: The failures are critical, impacting VCF 5.2 upgrade cycle and the ability to migrate VMs since vCenter server 7.0.3 is end-of-support.