The document explains what may cause the HCX Appliance-Management failed to start.
Symptoms:
The HCX Manager UI:443 and Appliance UI:9443 were not loading.
From our investigation we observed that there appeared to be an issue with the underlying HCX services. When attempting to stop / start services in the correct order, we found that Appliance-Management service failed to start.
The /common partition reaches 32% usage.
# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 5.9G 0 5.9G 0% /dev
tmpfs 5.9G 0 5.9G 0% /dev/shm
tmpfs 5.9G 716K 5.9G 1% /run
tmpfs 5.9G 0 5.9G 0% /sys/fs/cgroup
/dev/sda2 7.9G 4.5G 3.1G 60% /
/dev/sda1 237M 63M 161M 29% /recovery
/dev/sda6 44G 14G 29G 32% /common
/dev/sda3 7.9G 4.4G 3.1G 59% /slot2
tmpfs 1.2G 0 1.2G 0% /run/user/1000
Use "admin" credentials to SSH into the HCX Connector or Cloud Manager and change user to "root".
Stop all the services as shown below.
# systemctl stop zookeeper
# systemctl stop kafka
# systemctl stop app-engine
# systemctl stop web-engine
# systemctl stop appliance-management
# systemctl stop postgresdb
Start all services as per the sequence below.
# systemctl start postgresdb
# systemctl start zookeeper
# systemctl start kafka
# systemctl start app-engine
# systemctl start web-engine
# systemctl start appliance-management
Appliance-Management services failed to start
Job for appliance-management.service failed because a timeout was exceeded.
See "systemctl status appliance-management.service" and "journalctl -qxe" for details.
journalctl -qxe:
sonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( val,'{enterprise}'::text[],'"HybridityAdmin"'::jsonb),'{organization}'::text[],'"HybridityAdmin"'::jsonb),'{username}'::text[],'"HybridityAdmin"'::jsonb),'{userRoles}'::text[],'["System Administrator"]'::jsonb),'{transactionId}'::text[],'""'::jsonb),'{jobId}'::text[],'"########-####-####-884f8ab0c7e9"'::jsonb),'{jobType}'::text[],'"NetworkStretchJobs"'::jsonb),'{workflowType}'::text[],'"SyncDestinationSiteInfoJob"'::jsonb),'{state}'::text[],'"BEGIN"'::jsonb),'{previousState}'::text[],'"UNDEFINED_STATE"'::jsonb),'{recoverable}'::text[],'false'::jsonb),'{isQueued}'::text[],'true'::jsonb),'{isCancelled}'::text[],'false'::jsonb),'{isPaused}'::text[],'false'::jsonb),'{isRolledBack}'::text[],'false'::jsonb),'{isRollingBack}'::text[],'false'::jsonb),'{version}'::text[],'"1.0"'::jsonb),'{createTimeEpoch}'::text[],'1690310997461'::jsonb),'{absoluteExpireTimeEpoch}'::text[],'0'::jsonb),'{startTime}'::text[],'1690310997461'::jsonb),'{endTime}'::text[],'0'::jsonb) ,'{startDelayInSeconds}'::text[],'0.0'::jsonb),'{percentComplete}'::text[],'0'::jsonb),'{isDone}'::text[],'false'::jsonb),'{didFail}'::text[],'false'::jsonb),'{legId}'::text[],'"1"'::jsonb),'{originLegId}'::text[],'"1"'::jsonb),'{jobClass}'::text[],'"com.vmware.vchs.hybridity.messaging.adapter.JobProducerAdapter$1"'::jsonb),'{timeToExecute}'::text[],'1721764800036'::jsonb),'{service}'::text[],'"UNDEFINED_SERVICE"'::jsonb),'{userRealmId}'::text[],
'"########-####-####-#####-8368ae878709"'::jsonb),'{parentLegId}'::text[],'"1"'::jsonb),'{rowType}'::text[],'"JOB_ROW"'::jsonb) || '{"lastUpdated":"2024-07-23T20:00:00.036Z","lastUpdateOrganization":"HybridityAdmin","lastUpdateUser":"HybridityAdmin","lastUpdateEnterprise":"HybridityAdmin"}' where ((val ->>'jobId') = '########-####-####-####-884f8ab0c7e9' AND (val ->>'rowType') = 'JOB_ROW') ERROR: database is not accepting commands to avoid wraparound data loss in database "hybridity"
Hint: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
2024-07-23 20:00:00.040 UTC [QuartzScheduler_Worker-2, , , TxId: ] ERROR c.v.vchs.hybridity.messaging.Job- Exception java.lang.RuntimeException: Error queuing Job: Workflow TechSupportServiceJob/TECHSUPPORT_CLEANUP job (########-####-####-0e6fdba854d1 ) State:INITIATED PrevState:UNDEFINED_STATE called by Service:UNDEFINED_SERVICE
- The log message below show that the web engine is not running; however, the status of the web engine indicated that it is "Active"
# tail -f messages
2024-07-23T19:22:14.905+00:00 #################### su[1582]: Successful su for root by admin
2024-07-23T19:22:14.908+00:00 #################### su[1582]: + /dev/pts/1 admin:root
2024-07-23T19:22:14.908+00:00 #################### su[1582]: pam_unix(su:session): session opened for user root by admin(uid=1000)
2024-07-23T19:22:20.591+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:22:30.591+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:22:37.013+00:00 #################### systemd[1]: appliance-management.service: Start-pre operation timed out. Terminating.
2024-07-23T19:22:37.014+00:00 #################### systemd[1]: appliance-management.service: Failed with result 'timeout'.
2024-07-23T19:22:37.015+00:00 #################### systemd[1]: Failed to start Appliance Management.
2024-07-23T19:22:37.264+00:00 #################### systemd[1]: appliance-management.service: Service RestartSec=100ms expired, scheduling restart.
2024-07-23T19:22:37.264+00:00 #################### systemd[1]: appliance-management.service: Scheduled restart job, restart counter is at 1.
2024-07-23T19:22:37.264+00:00 #################### systemd[1]: Stopped Appliance Management.
2024-07-23T19:22:37.265+00:00 #################### systemd[1]: Starting Appliance Management...
2024-07-23T19:22:37.282+00:00 #################### service-dependency-check.sh[1674]: localhost:5432 - accepting connections
2024-07-23T19:22:37.282+00:00 #################### service-dependency-check.sh[1674]: postgresdb is running.
2024-07-23T19:22:37.287+00:00 #################### service-dependency-check.sh[1674]: zookeeper is running.
2024-07-23T19:22:37.911+00:00 #################### service-dependency-check.sh[1674]: kafka is running.
2024-07-23T19:22:37.911+00:00 #################### service-dependency-check.sh[1674]: app-engine is running.
2024-07-23T19:22:37.911+00:00 #################### service-dependency-check.sh[1674]: web-engine is not running.
2024-07-23T19:22:38.477+00:00 #################### mmd[1793]: [Err-metricManager] : Inserting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:22:40.592+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid waparound data loss in database "hybridity"
2024-07-23T19:22:50.593+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:23:00.593+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:23:07.913+00:00 #################### service-dependency-check.sh[1674]: web-engine is not running.
2024-07-23T19:23:08.490+00:00 #################### mmd[1793]: [Err-metricManager] : Inserting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:23:10.594+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:23:20.595+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
- Verified that web-engine service become and remain "active (running).
# systemctl status web-engine
● web-engine.service - WebEngine
Loaded: loaded (/etc/systemd/system/web-engine.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2024-07-23 21:20:12 UTC; 8min ago
Process: 17769 ExecStop=/bin/sh -a -c rm -f /var/run/admin/web-engine.date (code=exited, status=0/SUCCESS)
Process: 22303 ExecStartPre=/bin/sh -a -c for i in `seq 1 60`; do if netstat -ntlp 2>&1 | grep -q :8443 ; then sleep 2; continue; else exit 0; fi; done; exit 1 (code=exited, status=0/SUCCESS)
Process: 22268 ExecStartPre=/etc/systemd/service-dependency-check.sh postgresdb database-upgrade zookeeper kafka app-engine (code=exited, status=0/SUCCESS)
Main PID: 22309 (sh)
Tasks: 29 (limit: 4915)
Memory: 901.6M
CGroup: /system.slice/web-engine.service
├─22309 /bin/sh -a -c source /opt/vmware/deploy/xstream/get-xstream-whitelist.sh; /usr/java/jre/bin/java -Xmx2048m -Xms2048m -XX:MaxPermSize=512m -XX:+UnlockDiagnosticVMOptions -XX:+LogVMO>
└─22316 /usr/java/jre/bin/java -Xmx2048m -Xms2048m -XX:MaxPermSize=512m -XX:+UnlockDiagnosticVMOptions -XX:+LogVMOutput -XX:LogFile= -Dfile.encoding=UTF-8 -DuseDR2CbasedMigrateToVCA=false >
Jul 23 21:20:11 #################### systemd[1]: Starting WebEngine...
Jul 23 21:20:11 #################### service-dependency-check.sh[22268]: localhost:5432 - accepting connections
Jul 23 21:20:11 #################### service-dependency-check.sh[22268]: postgresdb is running.
Jul 23 21:20:11 #################### service-dependency-check.sh[22268]: database-upgrade is running.
Jul 23 21:20:11 #################### service-dependency-check.sh[22268]: zookeeper is running.
Jul 23 21:20:12 #################### service-dependency-check.sh[22268]: kafka is running.
Jul 23 21:20:12 #################### service-dependency-check.sh[22268]: app-engine is running.
Jul 23 21:20:12 #################### systemd[1]: Started WebEngine.
- Verified that appliance-management service "activating (not running)."
# systemctl status appliance-management
● appliance-management.service - Appliance Management
Loaded: loaded (/etc/systemd/system/appliance-management.service; enabled; vendor preset: enabled)
Active: activating (start-pre) since Tue 2024-07-23 21:20:21 UTC; 8min ago
Cntrl PID: 22366 (service-depende)
Tasks: 2 (limit: 4915)
Memory: 476.0K
CGroup: /system.slice/appliance-management.service
├─22366 /bin/bash /etc/systemd/service-dependency-check.sh ignore-service-failures postgresdb zookeeper kafka app-engine web-engine
└─23891 sleep 30
- At this point, the database can no longer accept updates or deletes to the tables. Even a vacuum cannot complete if ran automatically or manually.
hybridity=# VACUUM FULL "Job";
ERROR: database is not accepting commands to avoid wraparound data loss in database "hybridity"
HINT: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
hybridity=#
HCX Appliance-Management was unable to start due DB corruption.
If you experience the above behavior, open a Service Request with VMware by Broadcom Global Support Services and provide the below information.
- Affected HCX-MGR support bundle with DB dump.
- Have there been any power outages or storage issues in the environment?
- When was the issue first experienced? Detail what actions were being performed.
Alternatively, If this is a new deployment with no active Migrations or Network Extensions - Redeploying the affected HCX-MGR will resolve this behavior.