HCX site pairing failure or instability caused by /common partition exhaustion

Products

VMware HCX

Issue/Introduction

In VMware HCX environments, the site pairing between an on-premises HCX Connector and a cloud-side HCX Manager (such as Azure VMware Solution, Google Cloud VMware Engine, or VMware Cloud on AWS) may show as down, become unstable, or fail to allow edits to the site pairing or service mesh configuration.

One or more of the following symptoms may be observed:

The site pairing status shows as down from the cloud side while the on-premises side shows it as up.
Editing the site pairing or service mesh returns errors.
Migrations are stuck, fail to cancel, or hang at 0%.
The HCX Manager UI becomes unresponsive or extremely slow.
Interconnect (IX) and Network Extension (NE) appliance tunnels remain up despite the site pairing showing down.

Reviewing the on-premises HCX Connector logs reveals the following evidence. Run the following commands from the root of an extracted HCX Connector support bundle to identify the issue.

Check for Kafka EndpointLinkJob failures (site pairing impact)

In common/logs/admin/app*.log, Kafka reports the EndpointLinkJob topic (responsible for site pairing operations) as unavailable:

grep -i "EndpointLinkJob.*LEADER_NOT_AVAILABLE\|EndpointLinkJob.*unknown topic\|EndpointLinkJob.*partition error" common/logs/admin/app*.log

Expected output when this issue is present:

[EndpointLinkService_EventListener] WARN o.apache.kafka.clients.NetworkClient-
[Consumer clientId=consumer-EndpointLinkJob-0-EndpointLinkService-##,
groupId=EndpointLinkJob-0-EndpointLinkService] Error while fetching metadata
with correlation id ######: {EndpointLinkJob=LEADER_NOT_AVAILABLE}

[EndpointLinkService_EventListener] WARN o.a.k.c.consumer.internals.Fetcher-
[Consumer clientId=consumer-EndpointLinkJob-0-EndpointLinkService-##,
groupId=EndpointLinkJob-0-EndpointLinkService] Received unknown topic or
partition error in fetch for partition EndpointLinkJob-0

Check for /common partition disk exhaustion in Postgres

In common/logs/postgres/postgresql-*.log, Postgres reports disk exhaustion errors:

grep -i "no space left\|recovery mode\|FATAL" common/logs/postgres/postgresql-*.log

Expected output when this issue is present:

ERROR: could not extend file "base/16384/######": No space left on device
LOG: could not write temporary statistics file "pg_stat_tmp/db_0.tmp": No space left on device
FATAL: the database system is in recovery mode

Check /common partition utilization from app-perf metrics

The app-perf*.log files contain periodic snapshots of disk utilization:

grep "disk.free.*common\|disk.total.*common" common/logs/admin/app-perf*.log

Expected output when this issue is present shows low or zero free space:

disk.free{path=/common} value=0 GiB
disk.total{path=/common} value=43.393124 GiB

Check for Job table query timeouts (confirms bloat)

Queries against the Postgres Job table time out due to table bloat:

grep -i "canceling statement due to statement timeout" common/logs/postgres/postgresql-*.log

Expected output when this issue is present:

ERROR: canceling statement due to statement timeout
STATEMENT: SELECT * FROM "metrics_GetJobStats" ($1)

ERROR: canceling statement due to statement timeout
STATEMENT: SELECT * FROM "workflowManagement_checkIfJobIsCancelled" ($1)

Check site pairing health check job status

The SITE_PAIR_VERSION_CHECK job runs periodically. An "Error queuing Job" entry indicates the site pair check failed during the outage:

grep "SITE_PAIR_VERSION_CHECK" common/logs/admin/job*.log

Expected output during the issue shows a WARN entry with "Error queuing Job":

WARN c.v.v.h.messaging.adapter.JobManager- Error queuing Job: Workflow
EndpointLinkJob/SITE_PAIR_VERSION_CHECK job (########-####-####-####-############)
State:BEGIN PrevState:UNDEFINED_STATE called by Service:UNDEFINED_SERVICE

After recovery, the same grep shows the jobs completing normally with COMPLETE JOB and State:DELETE_ALERT.

Environment

VMware HCX 4.11.x
Cloud endpoints: Azure VMware Solution (AVS), Google Cloud VMware Engine (GCVE), VMware Cloud on AWS (VMC)

Cause

The /common partition on the on-premises HCX Connector becomes full due to Postgres Job table bloat. A regression in HCX 4.11.0 and 4.11.1 results in missing RPMs necessary for the Postgres vacuum functionality to operate correctly, causing the Job table to grow excessively over time. This is documented in KB 429452.

When the /common partition reaches full capacity, the following cascade of failures occurs:

Postgres failure — The database can no longer write temporary statistics files, extend data files, or process transactions. It eventually enters recovery mode.
Kafka failure — Kafka stores its data on the /common partition as well. When disk space runs out, Kafka can no longer serve the EndpointLinkJob partition, which is the Kafka topic responsible for site pairing operations.
Site pairing failure — With the EndpointLinkJob topic unavailable, the EndpointLinkService cannot process site pairing events, causing the site pair to appear down.
Migration failure — Migrations hang or fail to cancel because Job table queries time out and Postgres cannot process job state changes.

The cloud-side HCX Manager may continue to show the site pairing as down even after the on-premises issue is partially resolved, as the cloud-side UI state can become stale and may not immediately reflect the restored connectivity.

Resolution

To resolve this issue, extend the /common partition and vacuum the bloated Job table.

Step 1: Extend the /common partition

Follow the procedures in the following Knowledge Base articles to add disk space to the HCX Manager:

KB 409157 — Steps to Add a New Virtual Disk to an existing HCX Manager.
KB 373238 — Increasing HCX Manager Disk Space for HCX Software Version 4.10 or Higher.

Step 2: Vacuum the Job table

Follow the database maintenance procedure in KB 429452 to vacuum the bloated Job table:

Stop all HCX services except for the Postgres database.

For HCX 4.11.3 and later:

systemctl stop mmd
systemctl stop hcmProbe
systemctl stop appliance-management
systemctl stop web-engine
systemctl stop plan-engine
systemctl stop app-engine
systemctl stop kafka
systemctl stop zookeeper

For HCX 4.11.2 and earlier:

systemctl stop zookeeper
systemctl stop kafka
systemctl stop app-engine
systemctl stop web-engine
systemctl stop appliance-management

Log in to the HCX Postgres database via SSH as admin:
```
psql -U postgres hybridity
```
Execute the vacuum command on the Job table:
```
VACUUM FULL "Job";
```
Verify disk space has been reclaimed:
```
df -h /common
```

Restart all HCX services in order.

For HCX 4.11.3 and later:

systemctl start postgresdb
systemctl start zookeeper
systemctl start kafka
systemctl start app-engine
systemctl start plan-engine
systemctl start web-engine
systemctl start appliance-management
systemctl start mmd
systemctl start hcmProbe

For HCX 4.11.2 and earlier:

systemctl start postgresdb
systemctl start zookeeper
systemctl start kafka
systemctl start app-engine
systemctl start web-engine
systemctl start appliance-management

Verify the HCX Manager UI is responsive and the site pairing status returns to a healthy state.

Step 3: Verify recovery

After completing the above steps, confirm recovery by running the same grep commands from the Issue/Introduction section against a fresh support bundle. Verify the following:

The /common partition has sufficient free space:
```
df -h /common
```
The EndpointLinkJob LEADER_NOT_AVAILABLE errors are no longer present:
```
grep -i "EndpointLinkJob.*LEADER_NOT_AVAILABLE" common/logs/admin/app*.log
```
This command should return no results from timestamps after the recovery.
No "No space left on device" errors appear in the Postgres logs:
```
grep -i "no space left" common/logs/postgres/postgresql-*.log
```
This command should return no results from timestamps after the recovery.
The SITE_PAIR_VERSION_CHECK jobs are completing successfully:
```
grep "SITE_PAIR_VERSION_CHECK" common/logs/admin/job*.log | tail -10
```
Results should show COMPLETE JOB with State:DELETE_ALERT at regular intervals (approximately every 8 hours).
The app-perf metrics confirm healthy disk utilization and reduced Job table response times:
```
grep "disk.free.*common" common/logs/admin/app-perf*.log | tail -1
grep "workflowManagement_checkIfJobIsCancelled" common/logs/admin/app-perf*.log | tail -1
```
The disk.free value should show significant free space, and the checkIfJobIsCancelled mean response time should be under 1 millisecond (compared to multi-second timeouts when bloated).
The site pairing shows as up from both on-premises and cloud-side HCX Managers.
Migrations can be initiated and canceled as expected.

If the cloud-side site pairing status remains stale after the on-premises recovery is confirmed, a reboot of the cloud-side HCX Manager or re-registration of the site pairing may be required to refresh the UI state.

If the error persists after following these steps, contact Broadcom Support for further assistance.

Collect the following information and upload to the support case:

Output of df -h /common from the affected HCX Manager.
Source and target HCX log bundles with HCX database dump checked.
Screenshots of the site pairing status from both on-premises and cloud-side HCX Managers

Additional Information

KB 429452 — HCX Manager /common partition full due to Postgres JOB table bloat
KB 373238 — Increasing HCX Manager Disk Space for HCX Software Version 4.10 or Higher
KB 409157 — Steps to Add a New Virtual Disk to an existing HCX Manager
KB 408671 — HCX Manager Migration UI Goes Read-Only: Postgres Wraparound Error "database is not accepting commands to avoid wraparound data loss"
KB 321586 — HCX Connector or Cloud Manager unresponsive due to high utilization of "/common" directory