This article contains the steps suggested to determine the root causes of Postgres Cluster issues. These issues are caused by network events and/or ungraceful reboots.
The article Troubleshooting VMware Identity Manager postgres cluster deployed through vRealize Suite Lifecycle Manager provides the steps suggested to recover Postgres Cluster issue. In order to reboot Workspace One Access (VMware Identity Manager) cluster it is suggested to use the Power OFF and ON from vRealize Suite Lifecycle Manager, otherwise, the article Graceful Shutdown and Power On of a VMware Identity Manager PostgreSQL cluster can be followed.
As a best practice, validate that VMware Aria Suite Lifecycle (formerly known as vRealize Suite Lifecycle Manager vRSLCM) has updated the root and sshuser passwords of the Workspace One Access nodes in Locker, VMware Aria Suite Lifecycle has Scheduled health checks and it is able to recover automatically some replication delays in the Postgres Cluster, this will help to improve the availability of the Aria Suite products.
vIDM Postgres Health is unhealthy as indicated in the article Troubleshooting VMware Identity Manager postgres cluster deployed through vRealize Suite Lifecycle Manager
To diagnose the issue, collect the following information from your environment:
journalctl -u systemd-networkd
journalctl --list-boots
pgService log. To identify the primary node, refer to KB 367175.
/var/log/pgService/pgService.log
/var/log/vrlcm/vmware_vrlcm.log
/var/log/vrlcm/vmware_vrlcm-##.log) to find the history of the cluster status. You can search for relevant events with a command like grep -A 5 -r "replication_delay". The output provides key details:
up or down.node_id | hostname | port | status | lb_weight | role    | ... | replication_delay | last_status_change
---------+---------+------+--------+-----------+---------+-----+-------------------+---------------------
 0       | Host1   | 5432 |   up   | 0.333333  | master  | ... |        0          | 2019-10-14 06:05:42
 1       | Host2   | 5432 |   up   | 0.333333  | standby | ... |        0          | 2019-10-14 06:05:42
 2       | Host3   | 5432 |   up   | 0.333333  | standby | ... |        0          | 2019-10-14 06:05:42
(3 rows)
$ journalctl -u systemd-networkd
-- Logs begin at Fri 2022-03-04 05:35:39 UTC, end at Wed 2022-04-20 09:13:31 UTC. --
...
Apr 07 17:06:29 vidmnode1.lab.com systemd-networkd[1374]: eth0: Lost carrier
Apr 07 17:06:29 vidmnode1.lab.com systemd-networkd[1374]: eth0: Gained carrier
Apr 07 17:06:29 vidmnode1.lab.com systemd-networkd[1374]: eth0: Configured
/var/log/pgService/pgService.log around the same time. Look for fatal errors indicating a loss of network that would cause the watchdog service to shut down.
2022-05-27 05:57:32: pid 2488: WARNING: network IP is removed and system has no IP is assigned 
2022-05-27 05:57:32: pid 2488: DETAIL: changing the state to in network trouble 
2022-05-27 05:57:32: pid 2488: LOG: watchdog node state changed from [MASTER] to [IN NETWORK TROUBLE] 
2022-05-27 05:57:32: pid 2488: FATAL: system has lost the network 
2022-05-27 05:57:32: pid 2488: LOG: Watchdog is shutting down 
2022-05-27 05:57:32: pid 14653: LOG: watchdog: de-escalation started
journalctl --list-boots. In a healthy cluster, the nodes should have similar reboot times if a graceful restart of the environment was performed.Suite Lifecycle's health notifications changed in version 8.8.2.
VMware Aria Suite Lifecycle continuously monitors the health of VMware Identity Manager (vIDM) by performing a series of scheduled checks. This document outlines the key steps in this process and shows examples from the logs for both successful and failed checks.
The health check process includes the following steps:
ls command as the root user.
2022-10-26 21:14:31.665 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- vIDM ENDPOINT HOST :: #.#.#.102
2022-10-26 21:14:31.667 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- COMMAND :: ls
2022-10-26 21:14:31.713 INFO  [...] c.v.v.l.u.SshUtils -  -- Executing command --> ls
2022-10-26 21:14:34.961 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- vIDM ENDPOINT HOST :: #.#.#.102
2022-10-26 21:14:34.961 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- COMMAND :: ifconfig eth0:0 | grep 'inet addr:' | cut -d: -f2
...
2022-10-26 21:15:05.059 INFO  [...] c.v.v.l.v.c.u.VidmPgpoolUtil -  -- delegateIP is assigned to #.#.#.102
pgService is checked on each node.
2022-10-26 21:15:05.059 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- COMMAND :: /etc/init.d/pgService status
2022-10-26 21:15:05.110 INFO  [...] c.v.v.l.u.SshUtils -  -- Executing command --> /etc/init.d/pgService status
...
2022-10-26 21:15:06.158 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- Getting pgpool service status pgpool service is running
pgpool password is retrieved to perform subsequent database checks.
2022-10-26 21:15:08.359 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- COMMAND :: cat /usr/local/etc/pgpool.pwd
2022-10-26 21:15:08.408 INFO  [...] c.v.v.l.u.SshUtils -  -- Executing command --> cat /usr/local/etc/pgpool.pwd
MASTER in the watchdog configuration.
2022-10-26 21:15:38.464 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- vIDM ENDPOINT HOST :: #.#.#.102
2022-10-26 21:15:38.464 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- COMMAND :: su root -c "echo -e MXMXMXMX|/usr/local/bin/pcp_watchdog_info -p 9898 -h localhost -U pgpool"
...
vIDMNode02.lab.local:9999 Linux vIDMNode02.lab.local vIDMNode02.lab.local 9999 9000 7 STANDBY
vIDMNode01.lab.local:9999 Linux vIDMNode01.lab.local #.#.#.101 9999 9000 4 MASTER
vIDMNode03.lab.local:9999 Linux vIDMNode03.lab.local #.#.#.103 9999 9000 7 STANDBY
up, identifying the primary node, and checking for replication delay.
2022-10-26 21:16:09.662 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- COMMAND ::  su postgres -c "echo -e MXMXMXMX|/opt/vmware/vpostgres/current/bin/psql -h localhost -p 9999 -U pgpool postgres -c \"show pool_nodes\""
...
 node_id |  hostname  | port | status | lb_weight |  role   | ... | replication_delay 
---------+------------+------+--------+-----------+---------+-----+-------------------
 0       | #.#.#.101 | 5432 | up     | 0.333333  | standby | ... | 0                 
 1       | #.#.#.102 | 5432 | up     | 0.333333  | primary | ... | 0                 
 2       | #.#.#.103 | 5432 | up     | 0.333333  | standby | ... | 0                 
(3 rows)
2022-10-26 21:17:44.418 INFO  [...] VidmClusterHealthNotificationTask -  -- vIDM Cluster health status is GREEN !!!
2022-10-26 21:17:44.418 INFO  [...] VidmClusterHealthNotificationTask -  -- Skipping update of existing cluster health notification status level
Aria Suite Lifecycle periodically runs this task to auto-recover issues like replication delays caused by network glitches.
If Aria Suite Lifecycle is unable to SSH to the VMware Identity Manager nodes, the health check will fail and report the cluster as unhealthy.
Here is an example of the log output when an SSH connection fails:
2023-03-05 03:59:53.869 ERROR [...] c.v.v.l.u.SessionHolder -  -- SessionHolder.newSession Exception encountered
com.jcraft.jsch.JSchException: Auth fail
        at com.jcraft.jsch.Session.connect(Session.java:519)
        at com.vmware.vrealize.lcm.util.SessionHolder.newSession(SessionHolder.java:53)
        at com.vmware.vrealize.lcm.util.SshUtils.executeWithKnownTimeout(SshUtils.java:620)
        at com.vmware.vrealize.lcm.vidm.driver.helpers.VidmUtil.runVidmSshCommand(VidmUtil.java:185)
        [...]
2023-03-05 03:59:53.923 ERROR [...] c.v.v.l.u.SshUtils -  -- Exception cause : com.jcraft.jsch.JSchException: Auth fail
2023-03-05 03:59:53.929 ERROR [...] c.v.v.l.v.c.u.VidmPgpoolUtil -  -- Exception while validating SSH root credentials of the vIDM host - #.#.#.102
com.vmware.vrealize.lcm.util.exception.SshAuthenticationFailureException: Cannot execute ssh commands on the host - #.#.#.102, validate the SSH login credentials.
        at com.vmware.vrealize.lcm.vidm.driver.helpers.VidmUtil.runVidmSshCommand(VidmUtil.java:193)
        at com.vmware.vrealize.lcm.vidm.clustering.util.VidmPgpoolUtil.checkClusterStatus(VidmPgpoolUtil.java:829)
        [...]
[... The above authentication failure repeats for hosts #.#.#.103 and #.#.#.101 ...]
2023-03-05 04:00:05.875 INFO  [...] VidmClusterHealthNotificationTask -  -- Final Status is not empty : ---->
Unable to create SSH connection to the VMware Identity Manager host(s): [#.#.#.102, #.#.#.103, #.#.#.101] as root user from vRSLCM. Ensure vRSLCM inventory has the right root passwords for Identity Manager nodes and they are not expired. Trigger inventory sync of globalenvironment in vRSLCM to update the current root password.
2023-03-05 04:00:05.876 INFO  [...] VidmClusterHealthNotificationTask -  -- vIDM Cluster health status is RED !!!