In VMware HCX environments, the site pairing between an on-premises HCX Connector and a cloud-side HCX Manager (such as Azure VMware Solution, Google Cloud VMware Engine, or VMware Cloud on AWS) may show as down, become unstable, or fail to allow edits to the site pairing or service mesh configuration.
One or more of the following symptoms may be observed:
Reviewing the on-premises HCX Connector logs reveals the following evidence. Run the following commands from the root of an extracted HCX Connector support bundle to identify the issue.
In common/logs/admin/app*.log, Kafka reports the EndpointLinkJob topic (responsible for site pairing operations) as unavailable:
grep -i "EndpointLinkJob.*LEADER_NOT_AVAILABLE\|EndpointLinkJob.*unknown topic\|EndpointLinkJob.*partition error" common/logs/admin/app*.log
Expected output when this issue is present:
[EndpointLinkService_EventListener] WARN o.apache.kafka.clients.NetworkClient-
[Consumer clientId=consumer-EndpointLinkJob-0-EndpointLinkService-##,
groupId=EndpointLinkJob-0-EndpointLinkService] Error while fetching metadata
with correlation id ######: {EndpointLinkJob=LEADER_NOT_AVAILABLE}
[EndpointLinkService_EventListener] WARN o.a.k.c.consumer.internals.Fetcher- [Consumer clientId=consumer-EndpointLinkJob-0-EndpointLinkService-##, groupId=EndpointLinkJob-0-EndpointLinkService] Received unknown topic or partition error in fetch for partition EndpointLinkJob-0
In common/logs/postgres/postgresql-*.log, Postgres reports disk exhaustion errors:
grep -i "no space left\|recovery mode\|FATAL" common/logs/postgres/postgresql-*.log
Expected output when this issue is present:
ERROR: could not extend file "base/16384/######": No space left on device LOG: could not write temporary statistics file "pg_stat_tmp/db_0.tmp": No space left on device FATAL: the database system is in recovery mode
The app-perf*.log files contain periodic snapshots of disk utilization:
grep "disk.free.*common\|disk.total.*common" common/logs/admin/app-perf*.log
Expected output when this issue is present shows low or zero free space:
disk.free{path=/common} value=0 GiB
disk.total{path=/common} value=43.393124 GiB
Queries against the Postgres Job table time out due to table bloat:
grep -i "canceling statement due to statement timeout" common/logs/postgres/postgresql-*.log
Expected output when this issue is present:
ERROR: canceling statement due to statement timeout STATEMENT: SELECT * FROM "metrics_GetJobStats" ($1)
ERROR: canceling statement due to statement timeout STATEMENT: SELECT * FROM "workflowManagement_checkIfJobIsCancelled" ($1)
The SITE_PAIR_VERSION_CHECK job runs periodically. An "Error queuing Job" entry indicates the site pair check failed during the outage:
grep "SITE_PAIR_VERSION_CHECK" common/logs/admin/job*.log
Expected output during the issue shows a WARN entry with "Error queuing Job":
WARN c.v.v.h.messaging.adapter.JobManager- Error queuing Job: Workflow EndpointLinkJob/SITE_PAIR_VERSION_CHECK job (########-####-####-####-############) State:BEGIN PrevState:UNDEFINED_STATE called by Service:UNDEFINED_SERVICE
After recovery, the same grep shows the jobs completing normally with COMPLETE JOB and State:DELETE_ALERT.
The /common partition on the on-premises HCX Connector becomes full due to Postgres Job table bloat. A regression in HCX 4.11.0 and 4.11.1 results in missing RPMs necessary for the Postgres vacuum functionality to operate correctly, causing the Job table to grow excessively over time. This is documented in KB 429452.
When the /common partition reaches full capacity, the following cascade of failures occurs:
/common partition as well. When disk space runs out, Kafka can no longer serve the EndpointLinkJob partition, which is the Kafka topic responsible for site pairing operations.The cloud-side HCX Manager may continue to show the site pairing as down even after the on-premises issue is partially resolved, as the cloud-side UI state can become stale and may not immediately reflect the restored connectivity.
To resolve this issue, extend the /common partition and vacuum the bloated Job table.
Follow the procedures in the following Knowledge Base articles to add disk space to the HCX Manager:
Follow the database maintenance procedure in KB 429452 to vacuum the bloated Job table:
For HCX 4.11.3 and later:
systemctl stop mmd systemctl stop hcmProbe systemctl stop appliance-management systemctl stop web-engine systemctl stop plan-engine systemctl stop app-engine systemctl stop kafka systemctl stop zookeeper
For HCX 4.11.2 and earlier:
systemctl stop zookeeper systemctl stop kafka systemctl stop app-engine systemctl stop web-engine systemctl stop appliance-management
psql -U postgres hybridity
VACUUM FULL "Job";
df -h /common
For HCX 4.11.3 and later:
systemctl start postgresdb systemctl start zookeeper systemctl start kafka systemctl start app-engine systemctl start plan-engine systemctl start web-engine systemctl start appliance-management systemctl start mmd systemctl start hcmProbe
For HCX 4.11.2 and earlier:
systemctl start postgresdb systemctl start zookeeper systemctl start kafka systemctl start app-engine systemctl start web-engine systemctl start appliance-management
After completing the above steps, confirm recovery by running the same grep commands from the Issue/Introduction section against a fresh support bundle. Verify the following:
/common partition has sufficient free space:df -h /common
grep -i "EndpointLinkJob.*LEADER_NOT_AVAILABLE" common/logs/admin/app*.log
This command should return no results from timestamps after the recovery.
grep -i "no space left" common/logs/postgres/postgresql-*.log
This command should return no results from timestamps after the recovery.
grep "SITE_PAIR_VERSION_CHECK" common/logs/admin/job*.log | tail -10
Results should show COMPLETE JOB with State:DELETE_ALERT at regular intervals (approximately every 8 hours).
app-perf metrics confirm healthy disk utilization and reduced Job table response times:grep "disk.free.*common" common/logs/admin/app-perf*.log | tail -1 grep "workflowManagement_checkIfJobIsCancelled" common/logs/admin/app-perf*.log | tail -1
The disk.free value should show significant free space, and the checkIfJobIsCancelled mean response time should be under 1 millisecond (compared to multi-second timeouts when bloated).
If the cloud-side site pairing status remains stale after the on-premises recovery is confirmed, a reboot of the cloud-side HCX Manager or re-registration of the site pairing may be required to refresh the UI state.
If the error persists after following these steps, contact Broadcom Support for further assistance.
Collect the following information and upload to the support case:
df -h /common from the affected HCX Manager.