In Aria Lifecycle Manager (vRSLCM), VMware Identity Manager (vIDM) is flagged as critical under the health status check.
Upon investigation, one of the vIDM nodes is found in a SHUTDOWN state, causing disruptions in the service.
su root -c "echo -e 'password'|/usr/local/bin/pcp_watchdog_info -p 9898 -h localhost -U pgpool"
Output:
<Host1>:9999 Linux <Host1> <Host1> 9999 9000 4 MASTER <Host2>:9999 Linux <Host2> <Host2> 9999 9000 7 SHUTDOWN <Host3>:9999 Linux <Host3> <Host3> 9999 9000 7 STANDBY
Reviewing the /var/log/pgService/pgService.log, we find that the shutdown is caused by a network issue.
2025-03-18T14:03:38.680992+00:00 pgpool[21165]: [83872-1] 2025-03-18 14:03:38: pid 21165: WARNING: network IP is removed and system has no IP is assigned2025-03-18T14:03:38.681263+00:00 pgpool[21165]: [83872-2] 2025-03-18 14:03:38: pid 21165: DETAIL: changing the state to in network trouble2025-03-18T14:03:38.681593+00:00 pgpool[21165]: [83873-1] 2025-03-18 14:03:38: pid 21165: LOG: watchdog node state changed from [STANDBY] to [IN NETWORK TROUBLE]2025-03-18T14:03:38.681624+00:00 pgpool[21165]: [83874-1] 2025-03-18 14:03:38: pid 21165: FATAL: system has lost the network2025-03-18T14:03:38.681656+00:00 pgpool[21165]: [83875-1] 2025-03-18 14:03:38: pid 21165: LOG: Watchdog is shutting down
VMware Identity Manager 3.3.x
The affected vIDM node loses its network connection, which leads to its removal from the cluster and a transition to the SHUTDOWN state.
Step 1: Open SSH session to all three VIDM appliances and run the following command to get the password in use:
cat /usr/local/etc/pgpool.pwd
NOTE: If no value is returned, then the password will be 'password' for the step below.
Step 2: Verify Cluster Status
To determine the current state of the cluster, run the following command in the primary node of vIDM
su root -c "echo -e 'password'|/usr/local/bin/pcp_watchdog_info -p 9898 -h localhost -U pgpool"
Output:
<Host1>:9999 Linux <Host1> <Host1> 9999 9000 4 MASTER <Host2>:9999 Linux <Host2> <Host2> 9999 9000 7 SHUTDOWN <Host3>:9999 Linux <Host3> <Host3> 9999 9000 7 STANDBY
This confirms that Host2 is in SHUTDOWN mode while the other nodes remain functional.
Step 3: Restart the Affected Node’s Database Service
To bring the affected node back online, restart the PostgreSQL service using:
/etc/init.d/pgService restart
After the restart, the node will rejoin the cluster, and its status was verified.
Step 3: Enable Auto-Recovery
To prevent future occurrences, enable Auto-Recovery to allow automatic recovery of the cluster in case of similar network disruptions.