Alerts in VMware Aria Suite Lifecycle about vIDM cluster health with Secondary VMware Identity Manager nodes going 'down' occassionally
search cancel

Alerts in VMware Aria Suite Lifecycle about vIDM cluster health with Secondary VMware Identity Manager nodes going 'down' occassionally

book

Article ID: 322680

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Symptoms:
  • This article applies to VMware Aria Suite Lifecycle 8.14 or later with 'Cluster-Auto-Recovery' enabled for Workspace One Access (VMware Identity Manager vIDM 3.3.x)
image.png
image.png


Environment

VMware Identity Manager 3.3.x

Cause

Replication delays can occur due to underlying network latency. The VMware Identity Manager cluster auto-recovery service is configured with a default replication delay threshold of 1000 bytes. Beyond this threshold, the auto-recovery will be triggered and the secondary nodes taken 'down' to synchronize their data with the primary.

VMware Aria Suite Lifecycle does an hourly health check on the VMware Identity Manager appliances, and if the replication delays or node 'down' events occur during these health checks, VMware Aria Suite Lifecycle issues a health status notification.

Resolution

To resolve the issue correct the underlying network issue that's causing the replication delays between Identity Manager nodes.

Workaround:

To workaround the issue increase the auto_recovery_replication_delay_threshold setting

1. Find the typical replication delays on this setup:

    cd /var/log/pgService
    zgrep "SECONDARY_1_REPLICATION_DELAY" * | cut -d '=' -f2 | uniq | sort -r | head -n50
    # Similarly for SECONDARY_2
    zgrep "SECONDARY_2_REPLICATION_DELAY" * | cut -d '=' -f2 | uniq | sort -r | head -n50



2. Based on the highest delays reported above, configure the new threshold in /usr/local/etc/lcm-pgpool.conf for example:

    auto_recovery_replication_delay_threshold=10000



3. Perform this change on all the identity manager nodes in the cluster.


Additional Information

Impact/Risks:
Replication delays occur due to a slow underlying network or other unforeseen network characteristics. To alleviate the notification fatigue, We can configure the 'Cluster-Auto-Recovery' to consider a higher threshold for replication delays.