HCX Connector or Cloud Manager unresponsive due to high utilization of "/common" directory

Products

VMware HCX

Issue/Introduction

HCX Connector or Cloud Manager may experience various issues related to upgrades and disk space management :

Failed upgrade attempts due to insufficient disk space. - Not enough disk space available for upgrade on /tmp/upgrade/download
Unresponsive system due to high utilization of "/common" directory
Blocked upgrade processes from previous failed attempts.
The system becomes inaccessible or unusable during runtime or after an upgrade.
The app-startup.log and web-startup.log files grow excessively large due to log rotation issues

Common symptoms include :

Failed upgrade processes
Grayed-out update buttons in the UI
Migration and configuration workflows failing
Database or messaging services failing to start
The /common partition gets close to or reaches 100% usage (monitor when it reaches 60%)
Upgrade process stuck in "Running" state indefinitely
Log files not properly rotating, leading to storage exhaustion
When attempting to use the HCX Plugin via vCenter Server UI, the error below is displayed:
Http failure response for https://<vc-ip/fqdn>/plugins/com.vmware.hcx.plugin~4.10.0.24144741~-374630034/<hcx-ip/fqdn>-443/vsphere-client/ui/hcx/hcx-ui/rest/hybridity/api/sessions: 401 OK

The /common directory is a placeholder for all HCX core services, including Zookeeper, Kafka, App/Web engine, Job framework, PostgresDB, HCX Fleet appliances bundle, HCX Manager upgrade package file, and tech support bundle.

For example, if the common partition reaches 100%, run
df -h

Example Output :
Filesystem Size Used Avail Use% Mounted on/dev/sda6 44G 42G 0 100% /common

Network Extension Impact :

No impact on existing Network Extension services
Configuration changes are blocked until the resolution
IX/NE tunnels and Site Pairing remain in service

Migration Impact :

Active migrations may be disrupted during maintenance
No configuration or migration workflows will be serviced until resolved

Cause

Several factors can contribute to these issues

PostgresDB Issues :

Autovacuum process failing to start at boot up due to a timing condition in versions before 4.3.3
Failed table index cleaning and disk space management
High DB usage from large vCenter Server inventory or numerous migration workflows
Deployments with high DB usage may run out of partition space (particularly with large vCenter Server inventory with constant changes or a large number of migration workflows executed in short periods of time)

Storage Issues :

The file system becoming full from previous failed upgrade attempts
Partial upgrade downloads due to disruptions
Failure to extract upgrade bundles due to corruption
Insufficient space in /common/upgrade/*
Old log files not being purged within /common/logs/admin/*
Log files not purged from within /common/logs/upgrade/*
Large log files accumulating in /common/logs/admin/*

Check the following logs within the admin directory.

app-startup.log*
web-startup.log*

Also, run the following command to check the usage of other logs in that directory and remove large ones that are old when following the resolution.

ls –l

Log Rotation Issues :

In HCX 4.5.0, zookeeper.out log file configuration changed from INFO to DEBUG due to system requirements
This change resulted in a very high volume of log entries in zookeeper.out log was not getting rotated and zipped.
High volume of unrotated log entries (can take months to reach max size)
Kafka default broker configuration retained messages related to TOPICs
High volume of unrotated log entries
Messages not being cleared by underlying utility caused high utilization of /common directory

Resolution

Short-term Measures :

For upgrade issues due to /tmp, restart the HCX Manager/Connector appliance to clean up the /tmp directory.
For upgrade issues due to a large /common/logs/zookeeper/zookeeper.out file, you can also simply restart the HCX Manager/Connector appliance and this will be cleaned on startup.
For additional cleanup, please open a support case with Broadcom Support and refer to this KB article. For more information, see Creating and managing Broadcom support cases.

Long-term Measures :

Upgrade to the latest version.

Recommended to upgrade to the latest version (4.10.3+) as it contains all previous fixes, including

PostgresDB/Auto Vacuum issues fixed in version 4.3.3+
Kafka-db retention policies were improved and fixed in version 4.7.0+
Unexpected increase of zookeeper.out log files under "/common/logs/zookeeper" directory improved and fixed in version 4.8.0+
Reduced verbosity of startup logs in version 4.10.2+
Improved log rotation configuration in version 4.10.3+ 4.11+