Analyzing root causes of Postgres Cluster issues on Workspace One (vIDM VMware identity Manager)

Products

VMware Aria Suite

Issue/Introduction

This article contains the steps suggested to determine the root causes of Postgres Cluster issues. These issues are caused by network events and/or ungraceful reboots.

The article Troubleshooting VMware Identity Manager postgres cluster deployed through vRealize Suite Lifecycle Manager provides the steps suggested to recover Postgres Cluster issue. In order to reboot Workspace One Access (VMware Identity Manager) cluster it is suggested to use the Power OFF and ON from vRealize Suite Lifecycle Manager, otherwise, the article Graceful Shutdown and Power On of a VMware Identity Manager PostgreSQL cluster can be followed.

As a best practice, validate that VMware Aria Suite Lifecycle (formerly known as vRealize Suite Lifecycle Manager vRSLCM) has updated the root and sshuser passwords of the Workspace One Access nodes in Locker, VMware Aria Suite Lifecycle has Scheduled health checks and it is able to recover automatically some replication delays in the Postgres Cluster, this will help to improve the availability of the Aria Suite products.

Symptoms:

vIDM Postgres Health is unhealthy as indicated in the article Troubleshooting VMware Identity Manager postgres cluster deployed through vRealize Suite Lifecycle Manager

Environment

VMware Identity Manager 3.3.x

Resolution

Workaround

Prerequisites

To diagnose the issue, collect the following information from your environment:

From each VMware Identity Manager (vIDM) node, run the following commands to gather network logs and boot history:
```
journalctl -u systemd-networkd
journalctl --list-boots
```
From the primary PostgreSQL node, collect the pgService log. To identify the primary node, refer to KB 367175.
```
/var/log/pgService/pgService.log
```
From VMware Aria Suite Lifecycle (vRSLCM), collect a log bundle, specifically the engine logs. These logs contain historical data about the cluster's status.
Note: If the vIDM root or SSH user passwords have been updated in the Locker, this historical data may no longer be accessible by Suite Lifecycle.
```
/var/log/vrlcm/vmware_vrlcm.log
```

Analysis

Review the Suite Lifecycle engine logs (/var/log/vrlcm/vmware_vrlcm-##.log) to find the history of the cluster status. You can search for relevant events with a command like grep -A 5 -r "replication_delay". The output provides key details:

status: Shows if a node was up or down.
replication_delay: Indicates if there was a replication delay between nodes.
last_status_change: Shows the timestamp of the last status change, which is useful for correlating events.

node_id | hostname | port | status | lb_weight | role    | ... | replication_delay | last_status_change
---------+---------+------+--------+-----------+---------+-----+-------------------+---------------------
 0       | Host1   | 5432 |   up   | 0.333333  | master  | ... |        0          | 2019-10-14 06:05:42
 1       | Host2   | 5432 |   up   | 0.333333  | standby | ... |        0          | 2019-10-14 06:05:42
 2       | Host3   | 5432 |   up   | 0.333333  | standby | ... |        0          | 2019-10-14 06:05:42
(3 rows)

Using the timestamps identified in the previous step, check the network daemon logs on each vIDM node for corresponding network events like a lost or gained carrier.

$ journalctl -u systemd-networkd
-- Logs begin at Fri 2022-03-04 05:35:39 UTC, end at Wed 2022-04-20 09:13:31 UTC. --
...
Apr 07 17:06:29 vidmnode1.lab.com systemd-networkd[1374]: eth0: Lost carrier
Apr 07 17:06:29 vidmnode1.lab.com systemd-networkd[1374]: eth0: Gained carrier
Apr 07 17:06:29 vidmnode1.lab.com systemd-networkd[1374]: eth0: Configured

Search for errors in the primary node's /var/log/pgService/pgService.log around the same time. Look for fatal errors indicating a loss of network that would cause the watchdog service to shut down.

2022-05-27 05:57:32: pid 2488: WARNING: network IP is removed and system has no IP is assigned 
2022-05-27 05:57:32: pid 2488: DETAIL: changing the state to in network trouble 
2022-05-27 05:57:32: pid 2488: LOG: watchdog node state changed from [MASTER] to [IN NETWORK TROUBLE] 
2022-05-27 05:57:32: pid 2488: FATAL: system has lost the network 
2022-05-27 05:57:32: pid 2488: LOG: Watchdog is shutting down 
2022-05-27 05:57:32: pid 14653: LOG: watchdog: de-escalation started

Validate the timestamps of each node's last reboot from the output of journalctl --list-boots. In a healthy cluster, the nodes should have similar reboot times if a graceful restart of the environment was performed.

VMware Aria Suite Lifecycle Notifications

Suite Lifecycle's health notifications changed in version 8.8.2.

Suite Lifecycle 8.8.2 and later
- RED: Postgres is down or there are high, unrecoverable replication delays.
- YELLOW: There are minor, auto-recoverable replication delays.
- GREEN: The cluster is healthy, all nodes are up, and there is no replication delay.
Versions prior to 8.8.2
- RED: The cluster is unhealthy. This could mean a node is down, Postgres is down, or there is any level of replication delay.
- GREEN: The cluster is healthy.

Additional Information

Understanding Scheduled Health Checks

VMware Aria Suite Lifecycle continuously monitors the health of VMware Identity Manager (vIDM) by performing a series of scheduled checks. This document outlines the key steps in this process and shows examples from the logs for both successful and failed checks.

The health check process includes the following steps:

First, it validates SSH connectivity to each node by running a simple ls command as the root user.

2022-10-26 21:14:31.665 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- vIDM ENDPOINT HOST :: #.#.#.102
2022-10-26 21:14:31.667 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- COMMAND :: ls
2022-10-26 21:14:31.713 INFO  [...] c.v.v.l.u.SshUtils -  -- Executing command --> ls

It then identifies which node holds the delegated IP address for the cluster.

2022-10-26 21:14:34.961 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- vIDM ENDPOINT HOST :: #.#.#.102
2022-10-26 21:14:34.961 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- COMMAND :: ifconfig eth0:0 | grep 'inet addr:' | cut -d: -f2
...
2022-10-26 21:15:05.059 INFO  [...] c.v.v.l.v.c.u.VidmPgpoolUtil -  -- delegateIP is assigned to #.#.#.102

The status of the pgService is checked on each node.

2022-10-26 21:15:05.059 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- COMMAND :: /etc/init.d/pgService status
2022-10-26 21:15:05.110 INFO  [...] c.v.v.l.u.SshUtils -  -- Executing command --> /etc/init.d/pgService status
...
2022-10-26 21:15:06.158 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- Getting pgpool service status pgpool service is running

The pgpool password is retrieved to perform subsequent database checks.

2022-10-26 21:15:08.359 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- COMMAND :: cat /usr/local/etc/pgpool.pwd
2022-10-26 21:15:08.408 INFO  [...] c.v.v.l.u.SshUtils -  -- Executing command --> cat /usr/local/etc/pgpool.pwd

Next, it determines which node is the current MASTER in the watchdog configuration.

2022-10-26 21:15:38.464 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- vIDM ENDPOINT HOST :: #.#.#.102
2022-10-26 21:15:38.464 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- COMMAND :: su root -c "echo -e MXMXMXMX|/usr/local/bin/pcp_watchdog_info -p 9898 -h localhost -U pgpool"
...
vIDMNode02.lab.local:9999 Linux vIDMNode02.lab.local vIDMNode02.lab.local 9999 9000 7 STANDBY
vIDMNode01.lab.local:9999 Linux vIDMNode01.lab.local #.#.#.101 9999 9000 4 MASTER
vIDMNode03.lab.local:9999 Linux vIDMNode03.lab.local #.#.#.103 9999 9000 7 STANDBY

It then connects to the database to check the status of all pool nodes, verifying they are up, identifying the primary node, and checking for replication delay.

2022-10-26 21:16:09.662 INFO  [...] c.v.v.l.v.d.h.VidmUtil -  -- COMMAND ::  su postgres -c "echo -e MXMXMXMX|/opt/vmware/vpostgres/current/bin/psql -h localhost -p 9999 -U pgpool postgres -c \"show pool_nodes\""
...
 node_id |  hostname  | port | status | lb_weight |  role   | ... | replication_delay 
---------+------------+------+--------+-----------+---------+-----+-------------------
 0       | #.#.#.101 | 5432 | up     | 0.333333  | standby | ... | 0                 
 1       | #.#.#.102 | 5432 | up     | 0.333333  | primary | ... | 0                 
 2       | #.#.#.103 | 5432 | up     | 0.333333  | standby | ... | 0                 
(3 rows)

Finally, based on the previous checks, the overall cluster health status is determined.

2022-10-26 21:17:44.418 INFO  [...] VidmClusterHealthNotificationTask -  -- vIDM Cluster health status is GREEN !!!
2022-10-26 21:17:44.418 INFO  [...] VidmClusterHealthNotificationTask -  -- Skipping update of existing cluster health notification status level

Aria Suite Lifecycle periodically runs this task to auto-recover issues like replication delays caused by network glitches.

Example Failure Scenario: SSH Authentication Failure

If Aria Suite Lifecycle is unable to SSH to the VMware Identity Manager nodes, the health check will fail and report the cluster as unhealthy.

Here is an example of the log output when an SSH connection fails:

2023-03-05 03:59:53.869 ERROR [...] c.v.v.l.u.SessionHolder -  -- SessionHolder.newSession Exception encountered
com.jcraft.jsch.JSchException: Auth fail
        at com.jcraft.jsch.Session.connect(Session.java:519)
        at com.vmware.vrealize.lcm.util.SessionHolder.newSession(SessionHolder.java:53)
        at com.vmware.vrealize.lcm.util.SshUtils.executeWithKnownTimeout(SshUtils.java:620)
        at com.vmware.vrealize.lcm.vidm.driver.helpers.VidmUtil.runVidmSshCommand(VidmUtil.java:185)
        [...]
2023-03-05 03:59:53.923 ERROR [...] c.v.v.l.u.SshUtils -  -- Exception cause : com.jcraft.jsch.JSchException: Auth fail
2023-03-05 03:59:53.929 ERROR [...] c.v.v.l.v.c.u.VidmPgpoolUtil -  -- Exception while validating SSH root credentials of the vIDM host - #.#.#.102
com.vmware.vrealize.lcm.util.exception.SshAuthenticationFailureException: Cannot execute ssh commands on the host - #.#.#.102, validate the SSH login credentials.
        at com.vmware.vrealize.lcm.vidm.driver.helpers.VidmUtil.runVidmSshCommand(VidmUtil.java:193)
        at com.vmware.vrealize.lcm.vidm.clustering.util.VidmPgpoolUtil.checkClusterStatus(VidmPgpoolUtil.java:829)
        [...]

[... The above authentication failure repeats for hosts #.#.#.103 and #.#.#.101 ...]

2023-03-05 04:00:05.875 INFO  [...] VidmClusterHealthNotificationTask -  -- Final Status is not empty : ---->
Unable to create SSH connection to the VMware Identity Manager host(s): [#.#.#.102, #.#.#.103, #.#.#.101] as root user from vRSLCM. Ensure vRSLCM inventory has the right root passwords for Identity Manager nodes and they are not expired. Trigger inventory sync of globalenvironment in vRSLCM to update the current root password.
2023-03-05 04:00:05.876 INFO  [...] VidmClusterHealthNotificationTask -  -- vIDM Cluster health status is RED !!!

Attachments

Scheduled vIDM Health check - full output get_app