HCX Bulk Migrations fail with "Initial sync failed" due to HBR service session exhaustion

Products

VMware HCX

Issue/Introduction

Bulk migrations were manually cancelled due to storage constraints on the destination datastore.
Subsequent attempts to initiate migrations failed with an initial sync failed error.
Insufficient storage space at the target site caused NFC errors between the IX-R appliance and target hosts.

/var/log/vmware/hbrsrv.log

<time stamp> info hbrsrv[19895] [Originator@6876 sub=StorageManager groupID=VRID-####-####-####-####-cf47c0748b4c opID=hsl-27d4b1a7] Destroying NFC connection to host-####.
<time stamp> error hbrsrv[19895] [Originator@6876 sub=Main groupID=VRID-####-####-####-####-cf47c0748b4 opID=hsl-27d4b1a7] [1] NFC error: NFC_DISKLIB_ERROR
<time stamp> error hbrsrv[19895] [Originator@6876 sub=Main groupID=VRID-####-####-####-####-cf47c0748b4 opID=hsl-27d4b1a7] [2] Code set to: Generic storage error.

<time stamp> info hbrsrv[13272] [Originator@6876 sub=StorageManager groupID=VRID-####-####-####-####-8d7da3880fa8 opID=hsl-7328ad1c] Destroying NFC connection to host-999235.
<time stamp> error hbrsrv[13272] [Originator@6876 sub=Main groupID=VRID-####-####-####-####-8d7da3880fa8 opID=hsl-7328ad1c] [1] NFC error: NFC_NO_DISKSPACE

Eventually, the IX-R appliance VM hbr service locked up with the following error seen repeatedly in the /var/log/vmware/hbrsrv.log

<time stamp> error hbrsrv[20633] [Originator@6876 sub=HTTP session map] Out of HTTP sessions: Limited to 500
<time stamp> error hbrsrv[20633] [Originator@6876 sub=SessionCounter] Count of sessions per user:
--> root: 270

When any new bulk migration was attempted it did not reach the initial data transfer stage and instead errored out with the following target HCX manager /common/logs/admin/app.log error:

<time stamp> [ReplicationTransferService_SvcThread-74210, Ent: HybridityAdmin, , TxId: TxId: ####-####-####-####] ERROR c.v.h.s.r.jobs.SetupTarget- Job (####-####-####-####-dc2aa4a02a92) failed with exception
java.lang.RuntimeException: Error Running request
<snip>
Caused by: com.vmware.vchs.hybridity.adapters.hbr.fault.HbrServiceNotReachableException: Error communicating to HBR server <IX-IP>:8123. Reason: Service Unavailable

Environment

VMware HCX

Cause

The manual cancellation of bulk migrations due to destination storage constraints led to unstable storage conditions and Network File Copy (NFC) errors. These repeated storage failures prevented sessions from closing correctly, eventually causing the HBR service on the IX-R appliance to exceed its limit of 500 HTTP sessions.

Once this limit was reached, the HBR service became unreachable, resulting in the HbrServiceNotReachableException and Service Unavailable errors during subsequent migration attempts.

Resolution

Before triggering or during an active migration, ensure the target datastore has sufficient space and stability to handle migration traffic.
Since the replication server's HTTPS session is stuck, a restart of the hbrsrv service from destination IX Appliance is required to clear the session.

1. SSH into the HCX Manager appliance using admin role.
2. Access the Central CLI using ccli
3. List available appliances to find the correct ID using list.
4. Connect to the specific "IX-R" appliance using go <appliance_id>.
5. Restart the service using systemctl restart hbrsrv.

Note: This service must be restarted only from the target "IX-R" Appliance, depending on whether it is a forward or reverse migration.

Additional Information

Note: Restarting the dataplane-only IX-R appliance does not impact virtual machines in the Waiting Switchover state. The HCX Manager tracks these states independently. Consequently, it is not necessary to restart the HCX Manager virtual machine.