VMware Identity Manager UI fails to load with 502 error and JDBC Connection Exception

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

In pre-CSP-102092 vIDM environments (specifically 3.3.x), you may encounter the following:

502 Bad Gateway is displayed when accessing https://<vIDMFQDN>/SAAS.
Postgres Cluster appears healthy (1 Primary, 2 Replicas) when running show pool_nodes, yet the application cannot connect.
Log Error (/opt/vmware/horizon/workspace/logs/horizon.log): Could not open JPA EntityManager for transaction; nested exception is org.hibernate.exception.JDBCConnectionException: Unable to acquire JDBC Connection
VIP Instability: The delegateIP is automatically removed from the primary node within 30 seconds of being manually assigned.
Resolution Error: /etc/hosts contains the correct delegateIP mapping, but attempts to ping the delegateIP from any node fail.
Integrity Failure: Running a health check scripts or md5sum check on /etc/init.d/NetworkService or /usr/local/etc/auto-recovery.sh reveals inconsistent values across the three nodes.

Environment

VMware Identity Manager 3.3.7

Cause

This issue is caused by a "Split-Brain" safety mechanism at the management layer. The NetworkService executes a health-check script (auto-recovery.sh) every 30 seconds. One of its critical checks is to ensure that management binaries and scripts are identical across all nodes.

If an md5sum mismatch is detected, the service concludes the cluster state is inconsistent and executes an ifconfig eth0:0 down on the primary node. This is a defensive action to prevent data corruption, but it results in a 502 error because the application loses its path to the database.

Resolution

Triage & Rebuild via CSP Patch

Phase 1: Prevent delegateIP stripping

To restore service immediately, you must disable the automated health checks that are stripping the IP. Perform Steps 1 and 2 on all three nodes.

Stop and Disable the NetworkService:

/etc/init.d/NetworkService stop
 # Ensure it does not restart automatically during triage touch /usr/local/etc/LCM_DISABLE_AUTO_RECOVERY 

Verify Service is Stopped:
```
/etc/init.d/NetworkService status 
```
Note: The service must remain stopped until the CSP patch is applied later.
Manually Assign the delegateIP (Primary Node Only): Identify the delegateIP from /etc/hosts and apply it to the primary node (as identified by pcp_watchdog_info):
```
# Replace <Netmask> with your environment's values
ifconfig eth0:0 deletegateIP netmask <netmask> up
```
Persistence Check: Verify the IP "sticks" by running this loop for 60 seconds. If the IP remains, the NetworkService is successfully suppressed.
```
while true; do ifconfig eth0:0 | grep "inet "; sleep 5; done
```

Phase 2: Root Cause Verification

Confirm the md5sum mismatch that necessitated the patch. Run the health check from https://knowledge.broadcom.com/external/article/410295/how-do-i-generate-a-vidm-health-check-re.html once. This script is cluster aware and will check all necessary md5sums.

Result: If any node has a different hash than the others, the cluster logic is broken and requires the CSP patch to realign the binaries.

Phase 3: Permanent Rebuild (CSP-102092 Patch)

Apply https://knowledge.broadcom.com/external/article/412021 following all instructions. Once complete and Step#2.10 is complete, the postgres binaries should be accurate and clustering restored. Rerun the health script from https://knowledge.broadcom.com/external/article/410295/how-do-i-generate-a-vidm-health-check-re.html to validate all health checks are green.