VCF Operations Fleet Management is not Ready - failed to start VMware Postgres database server

Products

VCF Operations

Issue/Introduction

After applying the 9.0.1 patch to the VCF Operations fleet management appliance, you may experience the following symptoms only if you reboot the Fleet Management appliance:

In the VCF Operations UI, the Fleet Management >> Lifecyle section displays:
VCF Operations Fleet Management is not Ready
The PostgreSQL service fails to start after rebooting the appliance.

Running journalctl -xeu vpostgres.service on the Fleet management appliance using root credentials, displays:

<TimeStamp> <hostname> postgres[11273]: <TimeStamp> [11273] FATAL:  data directory "/var/vmware/vpostgres/current/pgdata" has invalid permissions
<TimeStamp> <hostname> postgres[11273]: <TimeStamp> [11273] DETAIL:  Permissions should be u=rwx (0700) or u=rwx,g=rx (0750).

The system journal contains "Permission denied" errors related to the PostgreSQL process ID (PID) file, which is located at /var/vmware/vpostgres/current/pgdata/postmaster.pid.
Certificate of VCF Operations fleet management is regenerated

This issue is also observed of the fleet management appliance gets rebooted as part of vCenter HA host isolation response.

You will see log entries similar to the following:

/var/log/vrlcm/vmware_vrlcm.log

Caused by: org.hibernate.exception.JDBCConnectionException: Unable to open JDBC Connection for DDL execution
        at org.hibernate.exception.internal.SQLStateConversionDelegate.convert(SQLStateConversionDelegate.java:112)
Caused by: org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
        at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:319)

journalctl -xeu vpostgres.service

<Hostname> postgres[20638]: pg_ctl: could not open PID file "/var/vmware/vpostgres/current/pgdata/postmaster.pid": Permission denied
<Hostname> systemd[1]: vpostgres.service: Control process exited, code=exited, status=1/FAILURE

Environment

VCF Operations Fleet Management Appliance 9.0.1

Resolution

Note : Before proceeding snapshots are required of the VCF Operations Fleet Management appliance as per KB Managing snapshots in vSphere Web Client

Procedure

Apply the correct permissions to the pgdata folder by executing the following command:
```
chmod 700 /var/vmware/vpostgres/current/pgdata/
```
Navigate to the /opt/vmware/vlcm/cert directory. The key and certificate files requiring change will have a timestamp in their names (e.g., server.crt.250###2056).

Run the following commands to move the timestamped files into place, replacing the filenames with the ones in your directory:
```
mv server.key.250###2056 server.key
mv server.crt.250###2056 server.crt
```
Disable the "cap_init" service executing the below command:

systemctl disable cap_init
systemctl daemon-reload
Restart the Nginx service:
```
systemctl restart nginx
```
Restart the Lifecycle Manager service:
```
systemctl restart vrlcm-server.service
```
Wait a couple of minutes for the service to initialize. You can check its status with the command below:
```
systemctl status vrlcm-server.service
```
Monitor the service startup log to confirm it is fully operational. This process may take several minutes.
```
tail -f /var/log/vrlcm/vmware_vrlcm.log
```
NOTE: A reboot of the SDDC manager appliance may be required if the services don't reconnect cleanly, in this state you will likely see the UI but it will not function as expected