HCX Network Extension HA Group stuck in MAINTENANCE state after DNS outage

Products

VMware HCX

Issue/Introduction

The HCX Network Extension (NE) High Availability (HA) group status displays as ‘MAINTENANCE’ in the HCX Manager UI.
Individual NE appliance tunnels are UP and passing traffic.
HA roles (Active/Standby) appear correctly negotiated when checked via CCLI ’show ha status‘ command on the appliances. The command 'show ha status' on the individual appliances confirms they are in a READY state with assigned roles (e.g., role: STANDBY or role: ACTIVE).

#show ha status
HA-State: READY
Role: ACTIVE
Local-ID: [NE Appliance ID]
The issue typically manifests following an environment-wide disruption, such as a DNS service outage.
HA Agent Status: Running the command ‘show service ha-agent’ on the NE appliance shows the service is active (running).

#show service ha-agent
ha-agent.service - HCX NE Appliance HA Agent
Loaded: loaded (/etc/systemd/system/ha-agent.service; disabled; preset: enabled)
Active: active (running) since Fri 2026-05-01 06:29:36 UTC; 5 days ago

Environment

VMware HCX

Cause

The HA group failed to automatically synchronize its overarching health status with the HCX Manager after recovering from the network disruption. While the individual appliances successfully re-established communication and negotiated roles, the HCX Manager database maintains a stale MAINTENANCE or SUSPENDED state to prevent configuration corruption during unstable connectivity.

Resolution

[1.] Verify Connectivity

Verify that both HCX Connector and Cloud Manager can consistently resolve the FQDN and reach the management IP of their respective vCenter Servers.
If DNS is located on an extended segment that was impacted, consider using temporary static host entries on the HCX Manager to ensure resolution during recovery.

[2.] Synchronize HA State (UI)

To clear the stale state without redeploying appliances:

OPTION A - Attempt to force the HCX Manager to recognize the actual state of the appliances:
Navigate to Interconnect > Service Mesh > View Appliances.
Go to the HA Management tab.
Select FORCE SYNC to synchronize the state.

OPTION B - Perform a "Recover" Operation:
The RECOVER action attempts to re-initialize the HA group and synchronize the environment.

Navigate to Interconnect > Service Mesh > View Appliances.
Select the HA Management tab.
Click RECOVER. This triggers a synchronization between the physical appliance state and the HCX database.

Note: If this process requires redeploying appliances, there will be a brief loss of connectivity on the extended network while the new appliances initialize.

[3.] Service Restart (If UI buttons are unavailable)

If the buttons are grayed out, restarting the management service can trigger a re-validation:
SSH into the HCX Manager as admin and switch to root.
Run: systemctl restart app-engine

[4.] Redeploy Appliances or Contact Broadcom Support

If the state remains stuck after a service restart, there are two options:

Redeploy the HA Group :
Redeploy the Network Extension appliances via the Service Mesh > View Appliances menu.
Warning: This operation will cause a brief loss of connectivity on the extended network data-path while the new appliances are initialized

OR

Contact Support : If redeployment is not feasible due to production constraints or 'Recover' & 'Redeploy' did not help to get the state back to normal, please gather the HCX Support bundle from both the source and destination sites along with Database. Then, open a Support Request with Broadcom for manual database correction.

Additional Information

Managing Network Extension High Availability