HCX cold migration failing with error "Error is A general system error occurred: Failed to send VCC_COMPLETE to destination. Total progress % is 'null'."

Products

VMware HCX

Issue/Introduction

HCX cold migrations starts but eventually fails with the following error:

The following ERROR message observed in HCX Manager /common/logs/admin/app.log:

<Timestamp> [VmotionService_SvcThread-11828, Ent: HybridityAdmin, , TxId:########-####-####-####-############] ERROR c.v.h.s.v.j.MonitorSourceSideProgressWorkflow- [migId=########-####-####-####-############]] Source side relocate 'task-24291174' failed for the virtual machine. Error is A general system error occurred: Failed to send VCC_COMPLETE to destination. Total progress % is 'null'.

The following message observed in vCenter /var/log/vmware/vpxd/vpxd.log:

<Timestamp> info vpxd[20897] [Originator@6876 sub=vmomi.soapStub[5453] opID=TaskLoop-host-50] SOAP request returned HTTP failure; <<cs p:00007fe1302199c0, PIPE:/var/run/envoy-hgw/hgw-pipe>, /hgw/host-50/vpxa>, method: waitForUpdatesEx; code: 504(Gateway Timeout); fault: (null)
<Timestamp> error vpxd[20897] [Originator@6876 sub=Vmomi opID=TaskLoop-host-50] Got vmacore exception when invoking VMOMI method; <</hgw/host-50>, /vpxa>, vmodl.query.PropertyCollector.waitForUpdatesEx, N7Vmacore4Http13HttpExceptionE(HTTP error response: Gateway Timeout)
--> [context]###################
########
########
[/context]
<Timestamp> info vpxd[20897] [Originator@6876 sub=TaskInfo opID=TaskLoop-host-50] WaitForUpdates failed; e: N5Vmomi5Fault17HostCommunication9ExceptionE(Fault cause: vmodl.fault.HostCommunication
--> )
--> [context]################=[/context]
<Timestamp> warning vpxd[20897] [Originator@6876 sub=VpxProfiler opID=TaskLoop-host-50] InvokeWithOpId [TotalTime] took 316526 ms
<Timestamp> info vpxd[20910] [Originator@6876 sub=vpxTaskInfo opID=54737-TxId:########-####-####-####-############-d7-01-TaskLoop-host-50] Task vim.Task:task-8655 disconnect with fault [N5Vmomi5Fault17HostCommunicationE]
<Timestamp> error vpxd[20905] [Originator@6876 sub=VmProv opID=54737-TxId:########-####-####-####-############-d7-01] Failed to track task vim.Task:task-8655 on host vim.HostSystem:host-50: Fault cause: vmodl.fault.HostCommunication
--> 
--> backtrace:
--> [backtrace begin] product: VMware VirtualCenter, version: 9.0.0, build: build-24755230, tag: vpxd, cpu: x86_64, os: linux, buildType: release
--> backtrace[00] libvmacore.so[0x004837CB]
--> backtrace[01] libvmacore.so[0x003730AC]: Vmacore::System::Stacktrace::CaptureWork(unsigned int)
--> backtrace[02] libvmacore.so[0x0038550F]: Vmacore::System::SystemFactory::CreateQuickBacktrace(Vmacore::Ref<Vmacore::System::Backtrace>&)
--> backtrace[03] libvmomi.so[0x00179985]
--> backtrace[04] libvmomi.so[0x00270E30]: Vmomi::Fault::HostCommunication::ThrowInternal()

The following message observed in vCenter /var/log/vmware/vpxd/vpxd.log:

<Timestamp> warning vpxd[4049275] [Originator@6876 sub=VpxProfiler opID=TaskLoop-host-2752627] InvokeWithOpId [TotalTime] took 31417 ms
<Timestamp> error vpxd[2983820] [Originator@6876 sub=VmProv opID=########-####-####-####-############-f4-01] Get exception while executing action vpx.vmprov.CreateDestinationVm:
--> (vmodl.fault.SystemError) {
--> reason = "Failed to send VCC_COMPLETE to destination",
--> msg = "fault.SystemError.summary",
--> }

Cause

Due to a known issue, vpxd (on VC) to vpxa on MA(Mobility Agent) property collector connection sometimes fails with a time out. NFC copy does complete successfully but VC fails to get the task update while fetching the update from the host.

Resolution

This is a known issue impacting VMware HCX.

Workaround:

Disable timeouts for /vpxa API in the envoy reverse proxy service in both source and target IX appliances:

Login to the HCX Manager appliance with SSH and open SSH session to IX appliance following the below steps:
- # ccli # list # go 0 (assuming that 0 is the IX appliance) # ssh
Take a backup of the file with the following command:
- # cp /etc/vmware/envoy.yaml /etc/vmware/envoy.yaml.bak
Open vi editor to edit envoy.yaml file:
- # vi /etc/vmware/envoy.yaml
Add "idle_timeout: 0s" under /vpxa section as per below example:
- Before:
  routes:
  - match:
  path_separated_prefix: "/vpxa"
  route:
  cluster: "vpxa-cluster"
  timeout: 0s
- After:
  routes:
  - match:
  path_separated_prefix: "/vpxa"
  route:
  cluster: "vpxa-cluster"
  timeout: 0s
  idle_timeout: 0s
- NOTE: Make sure that the indentation is not a TAB on keyboard, use SPACE bar. Bad Indentation causes envoy service to fail during restart.
Restart envoy reverse proxy service for changes to take effect:
- # systemctl restart envoy

If you have issues with the workaround, please open a Broadcom GS support case to get assistance.