Failed to start HCX "appliance-management" service.
search cancel

Failed to start HCX "appliance-management" service.

book

Article ID: 373583

calendar_today

Updated On:

Products

VMware HCX

Issue/Introduction

The document explains what may cause the HCX Appliance-Management failed to start.

Symptoms:

The HCX Manager UI:443 and Appliance UI:9443 were not loading.

From our investigation we observed that there appeared to be an issue with the underlying HCX services. When attempting to stop / start services in the correct order, we found that Appliance-Management service failed to start. 

The /common partition reaches 32% usage.

# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        5.9G     0  5.9G   0% /dev
tmpfs           5.9G     0  5.9G   0% /dev/shm
tmpfs           5.9G  716K  5.9G   1% /run
tmpfs           5.9G     0  5.9G   0% /sys/fs/cgroup
/dev/sda2       7.9G  4.5G  3.1G  60% /
/dev/sda1       237M   63M  161M  29% /recovery
/dev/sda6        44G   14G   29G  32% /common
/dev/sda3       7.9G  4.4G  3.1G  59% /slot2
tmpfs           1.2G     0  1.2G   0% /run/user/1000

 

Use "admin" credentials to SSH into the HCX Connector or Cloud Manager and change user to "root".
Stop all the services as shown below.

# systemctl stop zookeeper 
# systemctl stop kafka 
# systemctl stop app-engine 
# systemctl stop web-engine 
# systemctl stop appliance-management
# systemctl stop postgresdb

Start all services as per the sequence below.

# systemctl start postgresdb
# systemctl start zookeeper
# systemctl start kafka
# systemctl start app-engine
# systemctl start web-engine
# systemctl start appliance-management

 

Appliance-Management services failed to start

Job for appliance-management.service failed because a timeout was exceeded.
See "systemctl status appliance-management.service" and "journalctl -qxe" for details.

journalctl -qxe:

sonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( "jsonb_set_recursive"( val,'{enterprise}'::text[],'"HybridityAdmin"'::jsonb),'{organization}'::text[],'"HybridityAdmin"'::jsonb),'{username}'::text[],'"HybridityAdmin"'::jsonb),'{userRoles}'::text[],'["System Administrator"]'::jsonb),'{transactionId}'::text[],'""'::jsonb),'{jobId}'::text[],'"########-####-####-884f8ab0c7e9"'::jsonb),'{jobType}'::text[],'"NetworkStretchJobs"'::jsonb),'{workflowType}'::text[],'"SyncDestinationSiteInfoJob"'::jsonb),'{state}'::text[],'"BEGIN"'::jsonb),'{previousState}'::text[],'"UNDEFINED_STATE"'::jsonb),'{recoverable}'::text[],'false'::jsonb),'{isQueued}'::text[],'true'::jsonb),'{isCancelled}'::text[],'false'::jsonb),'{isPaused}'::text[],'false'::jsonb),'{isRolledBack}'::text[],'false'::jsonb),'{isRollingBack}'::text[],'false'::jsonb),'{version}'::text[],'"1.0"'::jsonb),'{createTimeEpoch}'::text[],'1690310997461'::jsonb),'{absoluteExpireTimeEpoch}'::text[],'0'::jsonb),'{startTime}'::text[],'1690310997461'::jsonb),'{endTime}'::text[],'0'::jsonb) ,'{startDelayInSeconds}'::text[],'0.0'::jsonb),'{percentComplete}'::text[],'0'::jsonb),'{isDone}'::text[],'false'::jsonb),'{didFail}'::text[],'false'::jsonb),'{legId}'::text[],'"1"'::jsonb),'{originLegId}'::text[],'"1"'::jsonb),'{jobClass}'::text[],'"com.vmware.vchs.hybridity.messaging.adapter.JobProducerAdapter$1"'::jsonb),'{timeToExecute}'::text[],'1721764800036'::jsonb),'{service}'::text[],'"UNDEFINED_SERVICE"'::jsonb),'{userRealmId}'::text[],
'"########-####-####-#####-8368ae878709"'::jsonb),'{parentLegId}'::text[],'"1"'::jsonb),'{rowType}'::text[],'"JOB_ROW"'::jsonb) || '{"lastUpdated":"2024-07-23T20:00:00.036Z","lastUpdateOrganization":"HybridityAdmin","lastUpdateUser":"HybridityAdmin","lastUpdateEnterprise":"HybridityAdmin"}' where ((val ->>'jobId') = '########-####-####-####-884f8ab0c7e9'  AND (val ->>'rowType') = 'JOB_ROW') ERROR: database is not accepting commands to avoid wraparound data loss in database "hybridity"
  Hint: Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
2024-07-23 20:00:00.040 UTC [QuartzScheduler_Worker-2, , , TxId: ] ERROR c.v.vchs.hybridity.messaging.Job- Exception java.lang.RuntimeException: Error queuing Job: Workflow TechSupportServiceJob/TECHSUPPORT_CLEANUP job (########-####-####-0e6fdba854d1 ) State:INITIATED PrevState:UNDEFINED_STATE called by Service:UNDEFINED_SERVICE

 

-  The log message below show that the web engine is not running; however, the status of the web engine indicated that it is "Active"

# tail -f messages

2024-07-23T19:22:14.905+00:00 #################### su[1582]: Successful su for root by admin
2024-07-23T19:22:14.908+00:00 #################### su[1582]: + /dev/pts/1 admin:root
2024-07-23T19:22:14.908+00:00 #################### su[1582]: pam_unix(su:session): session opened for user root by admin(uid=1000)
2024-07-23T19:22:20.591+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:22:30.591+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:22:37.013+00:00 #################### systemd[1]: appliance-management.service: Start-pre operation timed out. Terminating.
2024-07-23T19:22:37.014+00:00 #################### systemd[1]: appliance-management.service: Failed with result 'timeout'.
2024-07-23T19:22:37.015+00:00 #################### systemd[1]: Failed to start Appliance Management.
2024-07-23T19:22:37.264+00:00 #################### systemd[1]: appliance-management.service: Service RestartSec=100ms expired, scheduling restart.
2024-07-23T19:22:37.264+00:00 #################### systemd[1]: appliance-management.service: Scheduled restart job, restart counter is at 1.
2024-07-23T19:22:37.264+00:00 #################### systemd[1]: Stopped Appliance Management.
2024-07-23T19:22:37.265+00:00 #################### systemd[1]: Starting Appliance Management...
2024-07-23T19:22:37.282+00:00 #################### service-dependency-check.sh[1674]: localhost:5432 - accepting connections
2024-07-23T19:22:37.282+00:00 #################### service-dependency-check.sh[1674]: postgresdb is running.
2024-07-23T19:22:37.287+00:00 #################### service-dependency-check.sh[1674]: zookeeper is running.
2024-07-23T19:22:37.911+00:00 #################### service-dependency-check.sh[1674]: kafka is running.
2024-07-23T19:22:37.911+00:00 #################### service-dependency-check.sh[1674]: app-engine is running.
2024-07-23T19:22:37.911+00:00 #################### service-dependency-check.sh[1674]: web-engine is not running.
2024-07-23T19:22:38.477+00:00 #################### mmd[1793]: [Err-metricManager] : Inserting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:22:40.592+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid waparound data loss in database "hybridity"
2024-07-23T19:22:50.593+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:23:00.593+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:23:07.913+00:00 #################### service-dependency-check.sh[1674]: web-engine is not running.

2024-07-23T19:23:08.490+00:00 #################### mmd[1793]: [Err-metricManager] : Inserting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:23:10.594+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"
2024-07-23T19:23:20.595+00:00 #################### mmd[1793]: [Err-ucMetricManager] : Deleting error: pq: database is not accepting commands to avoid wraparound data loss in database "hybridity"

 

-   Verified that web-engine service become and remain  "active (running).

# systemctl status web-engine

● web-engine.service - WebEngine
   Loaded: loaded (/etc/systemd/system/web-engine.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2024-07-23 21:20:12 UTC; 8min ago
  Process: 17769 ExecStop=/bin/sh -a -c rm -f /var/run/admin/web-engine.date (code=exited, status=0/SUCCESS)
  Process: 22303 ExecStartPre=/bin/sh -a -c for i in `seq 1 60`; do if netstat -ntlp 2>&1 | grep -q :8443 ; then sleep 2; continue; else exit 0; fi; done; exit 1 (code=exited, status=0/SUCCESS)
  Process: 22268 ExecStartPre=/etc/systemd/service-dependency-check.sh postgresdb database-upgrade zookeeper kafka app-engine (code=exited, status=0/SUCCESS)
 Main PID: 22309 (sh)
    Tasks: 29 (limit: 4915)
   Memory: 901.6M
   CGroup: /system.slice/web-engine.service
           ├─22309 /bin/sh -a -c source /opt/vmware/deploy/xstream/get-xstream-whitelist.sh; /usr/java/jre/bin/java -Xmx2048m -Xms2048m -XX:MaxPermSize=512m -XX:+UnlockDiagnosticVMOptions -XX:+LogVMO>
           └─22316 /usr/java/jre/bin/java -Xmx2048m -Xms2048m -XX:MaxPermSize=512m -XX:+UnlockDiagnosticVMOptions -XX:+LogVMOutput -XX:LogFile= -Dfile.encoding=UTF-8 -DuseDR2CbasedMigrateToVCA=false >

Jul 23 21:20:11 #################### systemd[1]: Starting WebEngine...
Jul 23 21:20:11 #################### service-dependency-check.sh[22268]: localhost:5432 - accepting connections
Jul 23 21:20:11 #################### service-dependency-check.sh[22268]: postgresdb is running.
Jul 23 21:20:11 #################### service-dependency-check.sh[22268]: database-upgrade is running.
Jul 23 21:20:11 #################### service-dependency-check.sh[22268]: zookeeper is running.
Jul 23 21:20:12 #################### service-dependency-check.sh[22268]: kafka is running.
Jul 23 21:20:12 #################### service-dependency-check.sh[22268]: app-engine is running.
Jul 23 21:20:12 #################### systemd[1]: Started WebEngine.


-  Verified that appliance-management service "activating (not running)."

# systemctl status appliance-management

● appliance-management.service - Appliance Management
   Loaded: loaded (/etc/systemd/system/appliance-management.service; enabled; vendor preset: enabled)
   Active: activating (start-pre) since Tue 2024-07-23 21:20:21 UTC; 8min ago
Cntrl PID: 22366 (service-depende)
    Tasks: 2 (limit: 4915)
   Memory: 476.0K
   CGroup: /system.slice/appliance-management.service
           ├─22366 /bin/bash /etc/systemd/service-dependency-check.sh ignore-service-failures postgresdb zookeeper kafka app-engine web-engine
           └─23891 sleep 30

 

-  At this point, the database can no longer accept updates or deletes to the tables. Even a vacuum cannot complete if ran automatically or manually. 

hybridity=# VACUUM FULL "Job";
ERROR:  database is not accepting commands to avoid wraparound data loss in database "hybridity"
HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
hybridity=#

 

 

Cause

HCX Appliance-Management was unable to start due DB corruption.

 

Resolution

If you experience the above behavior, open a Service Request with VMware by Broadcom Global Support Services and provide the below information. 

- Affected HCX-MGR support bundle with DB dump.
- Have there been any power outages or storage issues in the environment?
- When was the issue first experienced? Detail what actions were being performed.

Alternatively, If this is a new deployment with no active Migrations or Network Extensions - Redeploying the affected HCX-MGR will resolve this behavior.