Emergency Cluster Bypass for VMware Identity Manager due to Out-of-Order Patching
search cancel

Emergency Cluster Bypass for VMware Identity Manager due to Out-of-Order Patching

book

Article ID: 428875

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

This article provides steps to force a 3-node VMware Identity Manager (vIDM) cluster into a "Logical Standalone" state to restore service availability.

Important: Perform this procedure only if you have confirmed that out-of-order patching has occurred in Aria Suite Lifecycle 8.18.0 (LCM) where the LCM patch was installed first and the CSP patch was skipped.

Symptoms include:

  • Critical service outage where the cluster management logic (Pgpool-II/Watchdog) fails to elect a master or bind the Virtual IP (VIP).
  • Cluster is in a "Split-Brain" or inconsistent state following a failed or improper patch cycle.
Warning: When this configuration is active, High Availability (HA) is DISABLED. If the Master node fails, the service will stop immediately as Pgpool is effectively removed from the decision loop.

Environment

  • VMware Identity Manager 3.3.x (Cluster Deployments)

  • VMware Aria Suite Lifecycle (formerly vRealize Suite Lifecycle Manager)

Cause

The issue is caused by an out-of-order patching workflow in LCM where the LCM patch was applied to the vIDM cluster before the required CSP patch. This violation of the patching order corrupts the cluster automation logic, preventing standard VIP delegation.

Resolution

To bypass the corrupted cluster logic and restore service manually, follow the phases below.

Phase 1: Stop auto-recovery automation

You must prevent auto-recovery.sh and Watchdog from overwriting manual changes or rebooting nodes.

  1. Log in to ALL nodes in the cluster via SSH.

  2. Stop the Network Service:

    service NetworkService stop
    
  3. Verify the service is stopped:

    /etc/init.d/NetworkService status
    

    Expected Output: Checking for service NetworkService: ..not running

Phase 2: Identify the Write-Master

You must confirm which node holds the Read/Write copy of the database to ensure data integrity.

  1. Run the following command on EACH node until you identify the Master:

    /opt/vmware/vpostgres/current/bin/psql -U postgres -d postgres -h localhost -c "SELECT pg_is_in_recovery();"
    
  2. Interpret the results:

    • t (True): Standby (Read-Only). Do not use this node.

    • f (False): Master (Read/Write). Proceed using this node.

Phase 3: Seize the Virtual IP (VIP)

Manually bind the floating VIP (delegateIP) to the Master node identified in Phase 2 so applications can connect.

  1. Log in to the Master Node identified in Phase 2.

  2. Verify your network mask (usually /24 or /21) by running ifconfig eth0.

  3. Bind the interface using the following syntax:

    # Example:
    ifconfig eth0:0 inet delegateIP netmask 255.255.255.0 up

Phase 4: Application Recovery

Clear stale connection pools and force vIDM to connect to the "new" local VIP.

  1. Perform the following steps on ALL nodes.

  2. Restart the Workspace service:

    service horizon-workspace restart
    

You may now follow the instructions in CSP-102547 Patch Instructions for VMware Identity Manager 3.3.7 and VMware Aria Suite Lifecycle 8.18.0 Patch 6 to properly patch the system into a proper clustered configuration.