VMware vPostgres service fails to start in Passive node when vCenter Server Appliance 6.5 vCenter HA is deployed

Products

VMware vCenter Server

Issue/Introduction

Symptoms:

In the vCenter HA configure tab the Passive node is Down and reports the message:

A replication failure might be occurring at the moment. Automatic failover protection is disabled.
If vCenter HA was recently enabled, initial replication might still be in progress and could take a few minutes.

The vCenter HA monitor tab reports:

Appliance configuration is out of sync.
Appliance state is out of sync.
Appliance sqlite db is out of sync.

Check the services status in Passive node with the commnd line "service-control --status" reporting the below two services are in Stopped status:

vmware-vcha
vmware-vpostgres

Only the below two services are in Running status:
vmware-statsmonitor
vmware-vmon

In the Passive node /var/log/vmware/vcha/vcha-*.log, you see similar error:

2018-07-11T08:58:24.874+08:00 error vcha[7F3896EFC700] [Originator@6876 sub=VchaUtil] Error executing command /bin/su: exit status=[4], stdout=[], stderr=[You are required to change your password immediately (password aged)
--> su: Authentication token is no longer valid; new one required
--> (Ignored)
--> pg_ctl: directory "/storage/db/vpostgres" is not a database cluster directory
--> ]
....
2018-07-11T16:01:04.920+08:00 error vcha[7F3896EFC700] [Originator@6876 sub=VchaUtil] Error executing command /usr/bin/python: exit status=[1], stdout=[logs available at: /var/log/vmware/vcha --> ], stderr=[]
2018-07-11T16:01:04.920+08:00 error vcha[7F3896EFC700] [Originator@6876 sub=VchaUtil] Error while running python script /usr/lib/vmware-vcha/scripts/postgres_passive.py
2018-07-11T16:01:04.941+08:00 info vcha[7F3896EFC700] [Originator@6876 sub=Agent] Fatal event in the HA Agent, shutting down
2018-07-11T16:01:04.941+08:00 info vcha[7F3896EFC700] [Originator@6876 sub=Agent] Shutting down HA Agent

In the /var/log/vmware/vcha/repl_passive_setup.log, you see similar error:

2018-07-11T07:59:53.150Z INFO repl_passive_setup running command ['/bin/su', '-s', '/bin/bash', '-', 'vpostgres', '-c', "/opt/vmware/vpostgres/current/bin/pg_basebackup --pgdata=/storage/db/vpostgres --xlogdir=/storage/dblog/vpostgres/pg_xlog --no-password --progress --dbname='host=10.200.20.1 port=5432 user=replicator password=<redacted> sslmode=require' --verbose --xlog-method=stream"]
2018-07-11T08:01:03.853Z INFO repl_passive_setup rc = [1], stdout = [], stderr = [You are required to change your password immediately (password aged)
....
pg_basebackup: could not get transaction log end position from server: ERROR: could not open file "./pg_hba.conf.bak": Permission denied
2018-07-11T08:01:03.853Z ERROR repl_passive_setup Failed full_sync: attempt: 100

Run the below command on both Active and Passive node to verify the password of the account "vpostgres" was expired:

[ ~ ]# chage --list vpostgres
Last password change : Mar 14, 2018
Password expires : Jun 13, 2018
Password inactive : never
Account expires : never
Minimum number of days between password change : 1
Maximum number of days between password change : 90
Number of days of warning before password expires : 7

In the Active node under /storage/db/vpostgres there is file named "pg_hba.conf.bak" and verify that the vpostgres account doesn't have permission to access it.

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Cause

This issue occurs if the vpostgres user password expired. This does not impact normal use of vCenter, however the vCenter HA gets impacted due to operations on additional files. In this example pg_hba.conf.bak.

Resolution

To resolve this issue, change the password of vpostgres user to never expire on all three vCenter HA nodes (active, passive and witness):
1. Log in to each vCenter HA node as root using SSH or Virtual Machine Console:
b. Change to the BASH shell by running the shell command:

Command> shell

c. Set the vpostgres user account to never expire by running this command:

[ ~ ]# chage -M -1 vpostgres

d. Confirm that the vpostgres user account is set to never expire by running this command:

[ ~ ]# chage --list vpostgres
Last password change : Jul 17, 2018
Password expires : never
Password inactive : never
Account expires : never
Minimum number of days between password change : 1
Maximum number of days between password change : -1
Number of days of warning before password expires : 7

Note: The preceding output excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

2. Remove the file pg_hba.conf.bak under /storage/db/vpostgres/ in both Active and Passive node.
3. Restart the Passive node.

If the Passive node still could not return to UP status after these steps, you need to re-deploy vCenter HA