HCX Connector or Cloud Manager unresponsive due to high utilization of "/common" directory
search cancel

HCX Connector or Cloud Manager unresponsive due to high utilization of "/common" directory

book

Article ID: 321586

calendar_today

Updated On:

Products

VMware HCX

Issue/Introduction

HCX Connector or Cloud Manager may experience various issues related to upgrades and disk space management :

  1. Failed upgrade attempts due to insufficient disk space. - Not enough disk space available for upgrade on /tmp/upgrade/download
  2. Unresponsive system due to high utilization of "/common" directory
  3. Blocked upgrade processes from previous failed attempts.
  4. The system becomes inaccessible or unusable during runtime or after an upgrade.
  5. The app-startup.log and web-startup.log files grow excessively large due to log rotation issues

Common symptoms include :

  • Failed upgrade processes
  • Grayed-out update buttons in the UI
  • Migration and configuration workflows failing
  • Database or messaging services failing to start
  • The /common partition gets close to or reaches 100% usage (monitor when it reaches 60%)
  • Upgrade process stuck in "Running" state indefinitely
  • Log files not properly rotating, leading to storage exhaustion
  • When attempting to use the HCX Plugin via vCenter Server UI, the error below is displayed: 
    Http failure response for https://<vc-ip/fqdn>/plugins/com.vmware.hcx.plugin~4.10.0.24144741~-374630034/<hcx-ip/fqdn>-443/vsphere-client/ui/hcx/hcx-ui/rest/hybridity/api/sessions: 401 OK


The /common directory is a placeholder for all HCX core services, including Zookeeper, Kafka, App/Web engine, Job framework, PostgresDB, HCX Fleet appliances bundle, HCX Manager upgrade package file, and tech support bundle.

For example, if the common partition reaches 100%, run
df -h

Example Output :
Filesystem Size Used Avail Use% Mounted on
/dev/sda6  44G  42G    0   100% /common

Network Extension Impact :

  • No impact on existing Network Extension services
  • Configuration changes are blocked until the resolution
  • IX/NE tunnels and Site Pairing remain in service

Migration Impact :

  • Active migrations may be disrupted during maintenance
  • No configuration or migration workflows will be serviced until resolved

Cause

Several factors can contribute to these issues

PostgresDB Issues :

  • Autovacuum process failing to start at boot up due to a timing condition in versions before 4.3.3
  • Failed table index cleaning and disk space management
  • High DB usage from large vCenter Server inventory or numerous migration workflows
  • Deployments with high DB usage may run out of partition space (particularly with large vCenter Server inventory with constant changes or a large number of migration workflows executed in short periods of time)

Storage Issues :

  • The file system becoming full from previous failed upgrade attempts
  • Partial upgrade downloads due to disruptions
  • Failure to extract upgrade bundles due to corruption
  • Insufficient space in /common/upgrade/*
  • Old log files not being purged within /common/logs/admin/*
  • Log files not purged from within /common/logs/upgrade/*
  • Large log files accumulating in /common/logs/admin/*

Check the following logs within the admin directory.

app-startup.log*
web-startup.log*

Also, run the following command to check the usage of other logs in that directory and remove large ones that are old when following the resolution.

ls –l

Log Rotation Issues :

  • In HCX 4.5.0, zookeeper.out log file configuration changed from INFO to DEBUG due to system requirements
  • This change resulted in a very high volume of log entries in zookeeper.out log was not getting rotated and zipped.
  • High volume of unrotated log entries (can take months to reach max size)
  • Kafka default broker configuration retained messages related to TOPICs
  • High volume of unrotated log entries
  • Messages not being cleared by underlying utility caused high utilization of /common directory

Resolution

Short-term Measures :

  • For upgrade issues due to /tmp, restart the HCX Manager/Connector appliance to clean up the /tmp directory. 
  • For upgrade issues due to a large /common/logs/zookeeper/zookeeper.out file, you can also simply restart the HCX Manager/Connector appliance and this will be cleaned on startup.
  • For additional cleanup, please open a support case with Broadcom Support and refer to this KB article. For more information, see Creating and managing Broadcom support cases.

Long-term Measures :

Upgrade to the latest version.

Recommended to upgrade to the latest version (4.10.3+) as it contains all previous fixes, including

  • PostgresDB/Auto Vacuum issues fixed in version 4.3.3+
  • Kafka-db retention policies were improved and fixed in version 4.7.0+
  • Unexpected increase of zookeeper.out log files under "/common/logs/zookeeper" directory improved and fixed in version 4.8.0+
  • Reduced verbosity of startup logs in version 4.10.2+
  • Improved log rotation configuration in version 4.10.3+ 4.11+

Additional Information