vPostgres service fails to start on vCenter Server due to several entries in TRUSTED_ROOT

Products

VMware vCenter Server

Issue/Introduction

Symptoms:

Service vmware-vpostgres fails to start on vCenter Server.
Most of the other services as well fail to start, such as vmware-vpxd-svcs and vmware-vpxd. For more information about vCenter services, see Stopping, Starting or Restarting VMware vCenter Server Appliance 6.x & above services (broadcom.com)
Can't connect to vCenter Database getting the below error,. For more information about VCDB, see Interacting with the vCenter Server Appliance 6.5/6.7/7.0/8.0 embedded vPostgres Database (broadcom.com)

Failed to connect to database: ODBC error: (08001) - [unixODBC]could not connect to server: Connection refused
-->     Is the server running on host "localhost" (127.0.0.1) and accepting
-->     TCP/IP connections on port 5432

vPostgres logs are not updated with any events.

Note: vPostgres are located in /var/log/vmware/vpostgres/postgresql-xx.log

In the /var/log/vmware/vpxd/vpxd.log you may see entries similar to

yyyy-mm-ddThh:mm:ss error vpxd[35339] [Originator@6876 sub=vpxdVdb] [VpxdVdb::SetDBType] Failed to connect to database: ODBC error: (08001) - [unixODBC]could not connect to server: Connection refused
-->     Is the server running on host "localhost" (127.0.0.1) and accepting
-->     TCP/IP connections on port 5432?
-->     Retry attempt: 16305 ...

/var/log/vmware/vmon/vmon-syslog.log doesn't indicate why vmware-vpostgres is not starting.

yyyy-mm-ddThh:mm:ss notice vmon  Received start request for vmware-vpostgres
yyyy-mm-ddThh:mm:ss notice vmon  <vmware-vpostgres-prestart> Constructed command: /opt/vmware/vpostgres/current/scripts/pg_pre_start
yyyy-mm-ddThh:mm:ss notice vmon  Executing service batch op API_HEALTH. IgnoreFail=1, service count=10
yyyy-mm-ddThh:mm:ss notice vmon  <vapi-endpoint-healthcmd> Constructed command: /usr/bin/python /usr/lib/vmware-vmon/vmonApiHealthCmd.py -n vapi-endpoint -u /vapiendpoint/health -t 30
yyyy-mm-ddThh:mm:ss notice vmon  <rhttpproxy-healthcmd> Constructed command: /usr/bin/python /usr/lib/vmware-rhttpproxy/rhttpproxy-vmon-apihealth.py
yyyy-mm-ddThh:mm:ss notice vmon  <vmware-vpostgres> Skip service health check. State STOPPED, Curr request 1
yyyy-mm-ddThh:mm:ss notice vmon  <vcha> Skip service health check. State STOPPED, Curr request 0
2020-07-07T20:33:03.041535+00:00 notice vmon  <vmware-postgres-archiver> Skip service health check. State STOPPED, Curr request 0
yyyy-mm-ddThh:mm:ss notice vmon  <vpxd-svcs> Skip service health check. State STOPPED, Curr request 0
yyyy-mm-ddThh:mm:ss notice vmon  <vpxd> Skip service health check. State STOPPING, Curr request 1
yyyy-mm-ddThh:mm:ss notice vmon  <sps> Skip service health check. State STOPPED, Curr request 0
yyyy-mm-ddThh:mm:ss notice vmon  <rbd> Skip service health check. State STOPPED, Curr request 0
yyyy-mm-ddThh:mm:ss notice vmon  <pschealth> Skip service health check. State STOPPED, Curr request 0
yyyy-mm-ddThh:mm:ss notice vmon  Successfully executed service batch operation API_HEALTH.

in /var/log/vmware/vmon/vmon.log you see this (greping for vpostgres is recommended)

yyyy-mm-ddThh:mm:ss Wa(03) host-xxxx <vmware-vpostgres> Service pre-start command's stderr: Generating /storage/db/vpostgres_ssl/root_ca.pem using store TRUSTED_ROOTS
yyyy-mm-ddThh:mm:ss Wa(03) host-xxxx <vmware-vpostgres> Service pre-start command's stderr: Grabbing alias list for store TRUSTED_ROOTS, attempt 1
yyyy-mm-ddThh:mm:ss Wa(03) host-xxxx <vmware-vpostgres-prestart> SysProcess exec timed out. Force kill. Pid ####
yyyy-mm-ddThh:mm:ss Er(02) host-xxxx <vmware-vpostgres> Service pre-start command failed with exit code 1.

In /var/log/vmware/vpxd-svcs/vpxd-svcs.log you may see the below error

SQL Error: org.apache.commons.dbcp.SQLNestedException: Cannot create PoolableConnectionFactory (Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.)

ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ¢ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ¢ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ¢ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ

Environment

VMware vCenter Server 6.x
VMware vCenter Server 7.x
VMware vCenter Server 8.x

Cause

This is caused due to corrupted certificates under /etc/ssl/certs , which causes an unexpectedly high number of certificate entries in TRUSTED_ROOT_CRLS store.

To confirm the cause of the issue, run the below command on the VCSA. If you are using an external PSC, run the following command on the vCenter and PSC both:

# /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store TRUSTED_ROOT_CRLS | grep Number

Output should look like:

ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ¢ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ¢ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ¢ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ¢ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ¢ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂNumber of entries in store :    xxxx

Notes:

If the output of the command is a big number (like hundreds or thousands), proceed with the resolution in this article.
In case of External Platform Service Controller, the above command will be run on the Platform Service Controller and vCenter both per the above.

Resolution

To resolve this issue, remove the extra entries in the TRUSTED_ROOT_CRLS store following the below steps:

Take an offline Snapshot of the VCSA virtual machine (and the Platform Service Controller virtual machine in case of external PSC).

Caution: Do NOT skip this step.

Connect to the VCSA (and the external PSC, if you are using one) through ssh.
Download the "crl-fix.sh" script attached to this article and upload to the impacted VCSA/PSC in the /tmp (or to the external Platform Service Controller) using WinSCP or FileZilla or copy its contents to a text file on the appliance using vi editor.

Note: If you get below error while connecting to the appliance via WinSCP run the following command. For more information, see Connecting to vCenter Server Virtual Appliance using WinSCP fails with the error: Received too large (1433299822 B) SFTP packet. Max supported packet size is 1024000 B (broadcom.com)
# chsh -s /bin/bash root as per above the link.

Host is not communicating for more than 15 seconds. If the problem repeats, try turning off 'Optimize connection buffer size'.
or
Cannot initialize SFTP protocol. Is the host running an SFTP server?

Browse to the /tmp directory.

# cd /tmp

Run the below command to make the file executable.

# chmod +x crl-fix.sh

Run the crl-fix.sh script.

# ./crl-fix.sh

Note: If you got the below error while running the script:

bash: ./crl-fix.sh: /bin/bash^M: bad interpreter: No such file or directory

This error is caused by DOS carriage returns added to the script when copying from a Windows-based text editor. To resolve this problem, run this command and rerun the script:

# sed -i -e 's/$//' crl-fix.sh

Notes:

The script may take some time before showing any progress depending on the number of entries in the TRUSTED_ROOT_CRLS store.
When the script completes, it should stop the vmafdd service and start it again as below:
NB: this can take more than 10 minutes if the number of entries is sufficiently high

Restart services of the VCSA and/or the external PSC

# service-control --stop --all
# service-control --start --all

Attachments

crl-fix get_app