VCF Operations Fleet Management is not Ready - failed to start VMware Postgres database server
search cancel

VCF Operations Fleet Management is not Ready - failed to start VMware Postgres database server

book

Article ID: 412351

calendar_today

Updated On:

Products

VCF Operations

Issue/Introduction

  • After applying the 9.0.1 patch to the VMware Cloud Foundation (VCF) Operations Fleet Management appliance, the appliance may fail to initialize correctly following a reboot.

  • This issue may occur during manual restarts or reboots triggered by vCenter High Availability (HA) host isolation responses.

  • The failure is primarily caused by incorrect permissions on the PostgreSQL data directory, which prevents the database service from starting.

Symptoms:

  • UI Error: The Fleet Management > Lifecycle section displays the message: VCF Operations Fleet Management is not Ready.

  • Service Failure: The PostgreSQL service (vpostgres.service) fails to start after the appliance reboots.

  • Log Errors: Running journalctl -xeu vpostgres.service reveals a fatal error:

    <hostname> postgres[11273]: <TimeStamp> [11273] FATAL:  data directory "/var/vmware/vpostgres/current/pgdata" has invalid permissions
    <hostname> postgres[11273]: <TimeStamp> [11273] DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).
    The system journal contains "Permission denied" errors related to the PostgreSQL process ID (PID) file, which is located at: /var/vmware/vpostgres/current/pgdata/postmaster.pid.

  • Certificate Change: The VCF Operations fleet management certificate may be unexpectedly regenerated.

  • The following log entries were observed in /var/log/vrlcm/vmware_vrlcm.log on the Fleet Management Appliance:

    Caused by: org.hibernate.exception.JDBCConnectionException: Unable to open JDBC Connection for DDL execution
            at org.hibernate.exception.internal.SQLStateConversionDelegate.convert(SQLStateConversionDelegate.java:112)
    Caused by: org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
            at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:319)

Environment

VCF Operations Fleet Management Appliance 9.0.1

Resolution

Note: Before implementing the resolution steps, a snapshot of the VCF Operations Fleet Management appliance must be taken. For detailed instructions, refer to Managing snapshots in vSphere Web Client.

  1. Set the appropriate permissions for the PostgreSQL data directory by running:
    • chmod 700 /var/vmware/vpostgres/current/pgdata/
  2. Navigate to the Lifecycle Manager certificate directory /opt/vmware/vlcm/cert
    • Locate the backup key and certificate files, which will have a timestamp appended to their names (e.g. server.crt.<timestamp>).
    • Rename them to remove the timestamp suffix
      • mv server.key.<timestamp> server.key
      • mv server.crt.<timestamp> server.crt
  3. Disable the cap_init service and reload the systemd manager configuration:
    • systemctl disable cap_init
    • systemctl daemon-reload
  4. Restart the Nginx and Lifecycle Manager services to apply the changes:
    • systemctl restart nginx
    • systemctl restart vrlcm-server.service
  5. Allow a few minutes for the Lifecycle Manager service to initialize. To check the service status:
    • systemctl status vrlcm-server.service
  6. To ensure the service has initialized successfully, review the startup logs. Allow several minutes for this sequence to complete:
    • tail -f /var/log/vrlcm/vmware_vrlcm.log

NOTE:  A reboot of the SDDC Manager appliance may be required if the services do not reconnect cleanly. In this state, the user interface may remain visible but will not function as expected.