The vmware-vcha and vmware-vpostgres services do not start on the passive node of a vCenter HA cluster.
search cancel

The vmware-vcha and vmware-vpostgres services do not start on the passive node of a vCenter HA cluster.

book

Article ID: 414943

calendar_today

Updated On:

Products

VMware vCenter Server

Issue/Introduction

-- In the vSphere Client UI, the Passive Node status is up, but the vmware-vcha and vmware-vpostgres services are stopped.
The UI displays the following message and the cluster state is "Degraded".
----
PostgreSQL replication is not in progress. Verify if PostgreSQL server is running on the Passive node and that the Passive node is reachable on the vCenter HA network
----

-- VCHA logs on the Active Node and Passive Node show output similar to the following:

- Active Node -
----
XXXX-XX-XXTXX:XX:XX.XXX+09:00 info vcha[24295] [Originator@6876 sub=ClusterMgr opID=WorkQueue-11d6c4ab] Slave id is : XXX.XXX.XXX.XXX
XXXX-XX-XXTXX:XX:XX.XXX+09:00 info vcha[24295] [Originator@6876 sub=ClusterMgr opID=WorkQueue-11d6c4ab] Slave id is : XXX.XXX.XXX.XXX
XXXX-XX-XXTXX:XX:XX.XXX+09:00 info vcha[24295] [Originator@6876 sub=ClusterMgr opID=WorkQueue-11d6c4ab] MASTER XXX.XXX.XXX.XXX
XXXX-XX-XXTXX:XX:XX.XXX+09:00 info vcha[24295] [Originator@6876 sub=ClusterMgr opID=WorkQueue-11d6c4ab] Quorum: YES
XXXX-XX-XXTXX:XX:XX.XXX+09:00 verbose vcha[24295] [Originator@6876 sub=Cluster opID=WorkQueue-11d6c4ab] Setting Key = /pcluster/livenodes Value = 2
XXXX-XX-XXTXX:XX:XX.XXX+09:00 verbose vcha[24295] [Originator@6876 sub=Cluster opID=WorkQueue-11d6c4ab] New version 8589934798 {2, 206}
XXXX-XX-XXTXX:XX:XX.XXX+09:00 verbose vcha[24295] [Originator@6876 sub=Cluster opID=WorkQueue-11d6c4ab] SetKvStoreInt version: 8589934798 isUpdate: true
XXXX-XX-XXTXX:XX:XX.XXX+09:00 verbose vcha[24295] [Originator@6876 sub=Cluster opID=WorkQueue-11d6c4ab] compressed from size 2457 to size 519 (max 2470)
XXXX-XX-XXTXX:XX:XX.XXX+09:00 verbose vcha[24295] [Originator@6876 sub=Cluster opID=WorkQueue-11d6c4ab] name kvstore version (8589934798 ?> 8589934797) force true
XXXX-XX-XXTXX:XX:XX.XXX+09:00 verbose vcha[24295] [Originator@6876 sub=Cluster opID=WorkQueue-11d6c4ab] Sent proposal to XXX.XXX.XXX.XXX (version 8589934798)
XXXX-XX-XXTXX:XX:XX.XXX+09:00 verbose vcha[24295] [Originator@6876 sub=Cluster opID=WorkQueue-11d6c4ab] Sent proposal to XXX.XXX.XXX.XXX (version 8589934798)
XXXX-XX-XXTXX:XX:XX.XXX+09:00 verbose vcha[24287] [Originator@6876 sub=Cluster opID=WorkQueue-75191a64] Received ack=true from XXX.XXX.XXX.XXX for kvstore (version 8589934798)
XXXX-XX-XXTXX:XX:XX.XXX+09:00 info vcha[24288] [Originator@6876 sub=Message opID=WorkQueue-11d6c4ab] WriteComplete: Error N7Vmacore16TimeoutExceptionE(Operation timed out: Stream: SSL(<io_obj p:0x00007f8290001b80, h:-1, <TCP 'XXX.XXX.XXX.XXX : XXXX'>, <TCP 'XXX.XXX.XXX.XXX : XXXXX'>>), duration: XX:XX:XX.XXXXX (hh:mm:ss.us))
--> [context]zKq7AVECAQAAAEuUVQEZdmNoYQAAxbVTbGlidm1hY29yZS5zbwAAUglDAIwxRACaSEsARzg3ABQ5NwCaYDcBxc8UdmNoYQABBWkUAQ66EAFDJRIBhb0QAbrpDQFp+Q0BFFoSASBcEgHObR0B6HodAdDCGgGPwxoA5ss3APkkOACTwFECro4AbGlicHRocmVhZC5zby4wAAMv3g9saWJjLnNvLjYA[/context] - pending writes dropped
----

- Passive Node -
----
XXXX-XX-XXTXX:XX:XX.XXX+09:00 verbose vcha[12526] [Originator@6876 sub=VchaUtil] Executing system command; /opt/vmware/vpostgres/current/bin/psql, args: [--dbname=host=XXX.XXX.XXX.XXX port=5432 user=replicator password=xxxxxxxxxxxxxxxx dbname=postgres application_name=vcha sslmode=verify-ca sslrootcert=/storage/db/vpostgres_ssl/root_ca.pem replication=1,--command=IDENTIFY_SYSTEM,--no-password]
XXXX-XX-XXTXX:XX:XX.XXX+09:00 info vcha[12526] [Originator@6876 sub=vpxUtil] System command failed; '/opt/vmware/vpostgres/current/bin/psql', args: [--dbname=host=XXX.XXX.XXX.XXX port=5432 user=replicator password=xxxxxxxxxxxxxxxx dbname=postgres application_name=vcha sslmode=verify-ca sslrootcert=/storage/db/vpostgres_ssl/root_ca.pem replication=1,--command=IDENTIFY_SYSTEM,--no-password], exit code: 2
--> stdout:
--> stderr: psql.bin: error: connection to server at "XXX.XXX.XXX.XXX", port 5432 failed: SSL error: certificate verify failed
-->
XXXX-XX-XXTXX:XX:XX.XXX+09:00 verbose vcha[12522] [Originator@6876 sub=Election opID=clusterElection.cpp:1570-3b0d1f35] CheckVersion: Version[3] Other host GT : 8589934791 > 8589934789
XXXX-XX-XXTXX:XX:XX.XXX+09:00 verbose vcha[12522] [Originator@6876 sub=Election opID=clusterElection.cpp:1570-3b0d1f35] CheckVersion: Pending version change 8589934791 >= 8589934791
XXXX-XX-XXTXX:XX:XX.XXX+09:00 verbose vcha[12526] [Originator@6876 sub=VchaUtil] Executing system command; /opt/vmware/vpostgres/current/bin/psql, args: [--dbname=host=XXX.XXX.XXX.XXX port=5432 user=replicator password=xxxxxxxxxxxxxxxx dbname=postgres application_name=vcha sslmode=verify-ca sslrootcert=/storage/db/vpostgres_ssl/root_ca.pem replication=1,--command=IDENTIFY_SYSTEM,--no-password]
XXXX-XX-XXTXX:XX:XX.XXX+09:00 info vcha[12526] [Originator@6876 sub=vpxUtil] System command failed; '/opt/vmware/vpostgres/current/bin/psql', args: [--dbname=host=XXX.XXX.XXX.XXX port=5432 user=replicator password=xxxxxxxxxxxxxxxx dbname=postgres application_name=vcha sslmode=verify-ca sslrootcert=/storage/db/vpostgres_ssl/root_ca.pem replication=1,--command=IDENTIFY_SYSTEM,--no-password], exit code: 2
--> stdout:
--> stderr: psql.bin: error: connection to server at "XXX.XXX.XXX.XXX", port 5432 failed: SSL error: certificate verify failed
-->
----

Environment

VMware vCenter Server 7.x
VMware vCenter Server 8.x

Cause

This issue occurs when the Postgres SSL certificate on the active and passive nodes has expired.

- /storage/db/vpostgres_ssl/server.crt

You can check the certificate expiration date with the following command:

# openssl x509 -in  /storage/db/vpostgres_ssl/server.crt -text -noout | grep -ie "Not Before" -ie "Not After";

----
Example output:
# openssl x509 -in  /storage/db/vpostgres_ssl/server.crt -text -noout | grep -ie "Not Before" -ie "Not After";
            Not Before: Jul 20 20:07:48 2023 GMT
            Not After : Jul 20 08:07:48 2025 GMT
----

Resolution

Unconfigure VCHA, update the SSL certificate, and then configure VCHA again to synchronize the certificate.

  1. Power off and delete the Passive and Witness node VMs.
  2. Log in to the Active node using SSH or the VM console.
  3. To enable the Bash shell, type shell at the appliance shll prompt.
  4. Run the following command to remove the vCenter HA configuration:
      # vcha-destroy -f
  5. Reboot the Active node.
    The Active node should now be a standalone vCenter Server Appliance.
  6. Take a snapshot (backup).
  7. Update the MACHINE_SSL certificate using the vCert attached in the following KB:
      vCert - Scripted vCenter Expired Certificate Replacement
      https://knowledge.broadcom.com/external/article/385107/vcert-scripted-vcenter-expired-certific.html
  8. Reconfigure the vCHA cluster.
      After completing the vCHA configuration, run the following command to update the expiration date (Not After) of server.crt and confirm that the error is resolved.
      # openssl x509 -in  /storage/db/vpostgres_ssl/server.crt -text -noout | grep -ie "Not Before" -ie "Not After";

Additional Information

- vCert - Scripted vCenter Expired Certificate Replacement

- There may be cases where the same symptom occurs even if there is no issue with the certificate expiration date. In this case, there may be a mismatch between the machine SSL certificate and the postgres certificate.Please refer to the following KB.
vCenter HA configuration is failing with error message "PostgreSQL replication is not in progress. Verify if PostgreSQL server is running on the Passive node and that the Passive node is reachable on the vCenter HA network"

- Japanese KB: https://knowledge.broadcom.com/external/article/414953