Failed to start HCX "appliance-management" service.

search cancel

Failed to start HCX "appliance-management" service.

book

Article ID: 373583

calendar_today

Updated On:

Products

VMware HCX

Issue/Introduction

The document explains the possible reasons why the HCX Appliance-Management service failed to start

The HCX Manager UI:443 and Appliance UI:9443 were not loading.

From our investigation we observed that there appeared to be an issue with the underlying HCX services. When attempting to stop / start services in the correct order, we found that Appliance-Management service failed to start.

The /common partition reaches 32% usage.

# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        5.9G     0  5.9G   0% /dev
tmpfs           5.9G     0  5.9G   0% /dev/shm
tmpfs           5.9G  716K  5.9G   1% /run
tmpfs           5.9G     0  5.9G   0% /sys/fs/cgroup
/dev/sda2       7.9G  4.5G  3.1G  60% /
/dev/sda1       237M   63M  161M  29% /recovery
/dev/sda6        44G   14G   29G  32% /common
/dev/sda3       7.9G  4.4G  3.1G  59% /slot2
tmpfs           1.2G     0  1.2G   0% /run/user/1000

Use "admin" credentials to SSH into the HCX Connector or Cloud Manager and change user to "root".
Stop all the services as shown below.

# systemctl stop zookeeper 
# systemctl stop kafka 
# systemctl stop app-engine 
# systemctl stop web-engine 
# systemctl stop appliance-management
# systemctl stop postgresdb

Start all services as per the sequence below.

# systemctl start postgresdb
# systemctl start zookeeper
# systemctl start kafka
# systemctl start app-engine
# systemctl start web-engine
# systemctl start appliance-management

Appliance-Management services failed to start

Job for appliance-management.service failed because a timeout was exceeded.
See "systemctl status appliance-management.service" and "journalctl -qxe" for details.

journalctl -qxe:

sonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( val,'{enterprise}'::text[],'"HybridityAdmin"'::jsonb),'{organization}'::text[],'"HybridityAdmin"'::jsonb),'{username}'::text[],'"HybridityAdmin"'::jsonb),'{userRoles}'::text[],'["System Administrator"]'::jsonb),'{transactionId}'::text[],'""'::jsonb),'{jobId}'::text[],'"########-####-####-884f8ab0c7e9"'::jsonb),'{jobType}'::text[],'"NetworkStretchJobs"'::jsonb),'{workflowType}'::text[],'"SyncDestinationSiteInfoJob"'::jsonb),'{state}'::text[],'"BEGIN"'::jsonb),'{previousState}'::text[],'"UNDEFINED_STATE"'::jsonb),'{recoverable}'::text[],'false'::jsonb),'{isQueued}'::text[],'true'::jsonb),'{isCancelled}'::text[],'false'::jsonb),'{isPaused}'::text[],'false'::jsonb),'{isRolledBack}'::text[],'false'::jsonb),'{isRollingBack}'::text[],'false'::jsonb),'{version}'::text[],'"1.0"'::jsonb),'{createTimeEpoch}'::text[],'1690310997461'::jsonb),'{absoluteExpireTimeEpoch}'::text[],'0'::jsonb),'{startTime}'::text[],'1690310997461'::jsonb),'{endTime}'::text[],'0'::jsonb) ,'{startDelayInSeconds}'::text[],'0.0'::jsonb),'{percentComplete}'::text[],'0'::jsonb),'{isDone}'::text[],'false'::jsonb),'{didFail}'::text[],'false'::jsonb),'{legId}'::text[],'"1"'::jsonb),'{originLegId}'::text[],'"1"'::jsonb),'{jobClass}'::text[],'"com.vmware.vchs.hybridity.messaging.adapter.JobProducerAdapter$1"'::jsonb),'{timeToExecute}'::text[],'1721764800036'::jsonb),'{service}'::text[],'"UNDEFINED_SERVICE"'::jsonb),'{userRealmId}'::text[],
'"########-####-####-#####-8368ae878709"'::jsonb),'{parentLegId}'::text[],'"1"'::jsonb),'{rowType}'::text[],'"JOB_ROW"'::jsonb) || '{"lastUpdated":"2024-07-23T20:00:00.036Z","lastUpdateOrganization":"HybridityAdmin","lastUpdateUser":"HybridityAdmin","lastUpdateEnterprise":"HybridityAdmin"}' where ((val ->>'jobId') = '########-####-####-####-884f8ab0c7e9'  AND (val ->>'rowType') = 'JOB_ROW') ERROR: database is not accepting commands to avoid wraparound data loss in database "hybridity"
  Hint: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
2024-07-23 20:00:00.040 UTC [QuartzScheduler_Worker-2, , , TxId: ] ERROR c.v.vchs.hybridity.messaging.Job- Exception java.lang.RuntimeException: Error queuing Job: Workflow TechSupportServiceJob/TECHSUPPORT_CLEANUP job (########-####-####-0e6fdba854d1 ) State:INITIATED PrevState:UNDEFINED_STATE called by Service:UNDEFINED_SERVICE

- The log message below show that the web engine is not running; however, the status of the web engine indicated that it is "Active"

# tail -f messages

2024-07-23T19:22:14.905+00:00 #################### su[1582]: Successful su for root by admin
2024-07-23T19:22:14.908+00:00 #################### su[1582]: + /dev/pts/1 admin:root
2024-07-23T19:22:14.908+00:00 #################### su[1582]: pam_unix(su:session): session opened for user root by admin(uid=1000)
2024-07-23T19:22:20.591+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:22:30.591+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:22:37.013+00:00 #################### systemd[1]: appliance-management.service: Start-pre operation timed out. Terminating.
2024-07-23T19:22:37.014+00:00 #################### systemd[1]: appliance-management.service: Failed with result 'timeout'.
2024-07-23T19:22:37.015+00:00 #################### systemd[1]: Failed to start Appliance Management.
2024-07-23T19:22:37.264+00:00 #################### systemd[1]: appliance-management.service: Service RestartSec=100ms expired, scheduling restart.
2024-07-23T19:22:37.264+00:00 #################### systemd[1]: appliance-management.service: Scheduled restart job, restart counter is at 1.
2024-07-23T19:22:37.264+00:00 #################### systemd[1]: Stopped Appliance Management.
2024-07-23T19:22:37.265+00:00 #################### systemd[1]: Starting Appliance Management...
2024-07-23T19:22:37.282+00:00 #################### service-dependency-check.sh[1674]: localhost:5432 - accepting connections
2024-07-23T19:22:37.282+00:00 #################### service-dependency-check.sh[1674]: postgresdb is running.
2024-07-23T19:22:37.287+00:00 #################### service-dependency-check.sh[1674]: zookeeper is running.
2024-07-23T19:22:37.911+00:00 #################### service-dependency-check.sh[1674]: kafka is running.
2024-07-23T19:22:37.911+00:00 #################### service-dependency-check.sh[1674]: app-engine is running.
2024-07-23T19:22:37.911+00:00 #################### service-dependency-check.sh[1674]: web-engine is not running.
2024-07-23T19:22:38.477+00:00 #################### mmd[1793]: [Err-metricManager] : Inserting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:22:40.592+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid waparound data loss in database "hybridity"
2024-07-23T19:22:50.593+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:23:00.593+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:23:07.913+00:00 #################### service-dependency-check.sh[1674]: web-engine is not running.

2024-07-23T19:23:08.490+00:00 #################### mmd[1793]: [Err-metricManager] : Inserting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:23:10.594+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:23:20.595+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"

- Verified that web-engine service become and remain "active (running).

# systemctl status web-engine

● web-engine.service - WebEngine
   Loaded: loaded (/etc/systemd/system/web-engine.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2024-07-23 21:20:12 UTC; 8min ago
  Process: 17769 ExecStop=/bin/sh -a -c rm -f /var/run/admin/web-engine.date (code=exited, status=0/SUCCESS)
  Process: 22303 ExecStartPre=/bin/sh -a -c for i in `seq 1 60`; do if netstat -ntlp 2>&1 | grep -q :8443 ; then sleep 2; continue; else exit 0; fi; done; exit 1 (code=exited, status=0/SUCCESS)
  Process: 22268 ExecStartPre=/etc/systemd/service-dependency-check.sh postgresdb database-upgrade zookeeper kafka app-engine (code=exited, status=0/SUCCESS)
 Main PID: 22309 (sh)
    Tasks: 29 (limit: 4915)
   Memory: 901.6M
   CGroup: /system.slice/web-engine.service
           ├─22309 /bin/sh -a -c source /opt/vmware/deploy/xstream/get-xstream-whitelist.sh; /usr/java/jre/bin/java -Xmx2048m -Xms2048m -XX:MaxPermSize=512m -XX:+UnlockDiagnosticVMOptions -XX:+LogVMO>
           └─22316 /usr/java/jre/bin/java -Xmx2048m -Xms2048m -XX:MaxPermSize=512m -XX:+UnlockDiagnosticVMOptions -XX:+LogVMOutput -XX:LogFile= -Dfile.encoding=UTF-8 -DuseDR2CbasedMigrateToVCA=false >

Jul 23 21:20:11 #################### systemd[1]: Starting WebEngine...
Jul 23 21:20:11 #################### service-dependency-check.sh[22268]: localhost:5432 - accepting connections
Jul 23 21:20:11 #################### service-dependency-check.sh[22268]: postgresdb is running.
Jul 23 21:20:11 #################### service-dependency-check.sh[22268]: database-upgrade is running.
Jul 23 21:20:11 #################### service-dependency-check.sh[22268]: zookeeper is running.
Jul 23 21:20:12 #################### service-dependency-check.sh[22268]: kafka is running.
Jul 23 21:20:12 #################### service-dependency-check.sh[22268]: app-engine is running.
Jul 23 21:20:12 #################### systemd[1]: Started WebEngine.

- Verified that appliance-management service "activating (not running)."

# systemctl status appliance-management

● appliance-management.service - Appliance Management
   Loaded: loaded (/etc/systemd/system/appliance-management.service; enabled; vendor preset: enabled)
   Active: activating (start-pre) since Tue 2024-07-23 21:20:21 UTC; 8min ago
Cntrl PID: 22366 (service-depende)
    Tasks: 2 (limit: 4915)
   Memory: 476.0K
   CGroup: /system.slice/appliance-management.service
           ├─22366 /bin/bash /etc/systemd/service-dependency-check.sh ignore-service-failures postgresdb zookeeper kafka app-engine web-engine
           └─23891 sleep 30

- At this point, the database can no longer accept updates or deletes to the tables. Even a vacuum cannot complete if ran automatically or manually.

hybridity=# VACUUM FULL "Job";
ERROR:  database is not accepting commands to avoid wraparound data loss in database "hybridity"
HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
hybridity=#

Environment

VMware HCX

Cause

HCX Appliance-Management was unable to start due DB corruption.

Resolution

If you experience the above behavior, open a Service Request with VMware by Broadcom Global Support Services by following how to raise support case and provide the below information.

Affected HCX-MGR support bundle with DB dump.
Have there been any power outages or storage issues in the environment?
When was the issue first experienced? Detail what actions were being performed.

Alternatively, If this is a new deployment with no active Migrations or Network Extensions - Redeploying the affected HCX-MGR will resolve this behavior.

Feedback

thumb_up Yes

thumb_down No