Symptoms:
- In the vCenter HA configure tab the Passive node is Down and reports the message:
A replication failure might be occurring at the moment. Automatic failover protection is disabled.
If vCenter HA was recently enabled, initial replication might still be in progress and could take a few minutes.
- The vCenter HA monitor tab reports:
Appliance configuration is out of sync.
Appliance state is out of sync.
Appliance sqlite db is out of sync.
- Check the services status in Passive node with the commnd line "service-control --status" reporting the below two services are in Stopped status:
vmware-vcha
vmware-vpostgres
Only the below two services are in Running status:
vmware-statsmonitor
vmware-vmon
- In the Passive node /var/log/vmware/vcha/vcha-*.log, you see similar error:
2018-07-11T08:58:24.874+08:00 error vcha[7F3896EFC700] [Originator@6876 sub=VchaUtil] Error executing command /bin/su: exit status=[4], stdout=[], stderr=[You are required to change your password immediately (password aged)
--> su: Authentication token is no longer valid; new one required
--> (Ignored)
--> pg_ctl: directory "/storage/db/vpostgres" is not a database cluster directory
--> ]
....
2018-07-11T16:01:04.920+08:00 error vcha[7F3896EFC700] [Originator@6876 sub=VchaUtil] Error executing command /usr/bin/python: exit status=[1], stdout=[logs available at: /var/log/vmware/vcha --> ], stderr=[]
2018-07-11T16:01:04.920+08:00 error vcha[7F3896EFC700] [Originator@6876 sub=VchaUtil] Error while running python script /usr/lib/vmware-vcha/scripts/postgres_passive.py
2018-07-11T16:01:04.941+08:00 info vcha[7F3896EFC700] [Originator@6876 sub=Agent] Fatal event in the HA Agent, shutting down
2018-07-11T16:01:04.941+08:00 info vcha[7F3896EFC700] [Originator@6876 sub=Agent] Shutting down HA Agent
- In the /var/log/vmware/vcha/repl_passive_setup.log, you see similar error:
2018-07-11T07:59:53.150Z INFO repl_passive_setup running command ['/bin/su', '-s', '/bin/bash', '-', 'vpostgres', '-c', "/opt/vmware/vpostgres/current/bin/pg_basebackup --pgdata=/storage/db/vpostgres --xlogdir=/storage/dblog/vpostgres/pg_xlog --no-password --progress --dbname='host=10.200.20.1 port=5432 user=replicator password=<redacted> sslmode=require' --verbose --xlog-method=stream"]
2018-07-11T08:01:03.853Z INFO repl_passive_setup rc = [1], stdout = [], stderr = [You are required to change your password immediately (password aged)
....
pg_basebackup: could not get transaction log end position from server: ERROR: could not open file "./pg_hba.conf.bak": Permission denied
2018-07-11T08:01:03.853Z ERROR repl_passive_setup Failed full_sync: attempt: 100
- Run the below command on both Active and Passive node to verify the password of the account "vpostgres" was expired:
[ ~ ]# chage --list vpostgres
Last password change : Mar 14, 2018
Password expires : Jun 13, 2018
Password inactive : never
Account expires : never
Minimum number of days between password change : 1
Maximum number of days between password change : 90
Number of days of warning before password expires : 7
- In the Active node under /storage/db/vpostgres there is file named "pg_hba.conf.bak" and verify that the vpostgres account doesn't have permission to access it.
Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.