Alerts in VMware Aria Suite Lifecycle about vIDM cluster health with Secondary VMware Identity Manager nodes going 'down' occassionally

search cancel

Alerts in VMware Aria Suite Lifecycle about vIDM cluster health with Secondary VMware Identity Manager nodes going 'down' occassionally

book

Article ID: 322680

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Symptoms:

This article applies to VMware Aria Suite Lifecycle 8.14 or later with 'Cluster-Auto-Recovery' enabled for Workspace One Access (VMware Identity Manager vIDM 3.3.x)

VMware Aria Suite Lifecycle 8.14 Release Notes
vIDM Cluster Auto-Recovery blog.
Secondary nodes have replication delays over 1000 bytes with the primary node and are intermittently 'down'.
Alerts are visible in VMware Aria Suite Lifecycle about vIDM Postgres Cluster health.
vIDM Dashboard shows 'GREEN' status for all components and services operational otherwise.

Environment

VMware Identity Manager 3.3.x

Cause

Replication delays can occur due to underlying network latency. The VMware Identity Manager cluster auto-recovery service is configured with a default replication delay threshold of 1000 bytes. Beyond this threshold, the auto-recovery will be triggered and the secondary nodes taken 'down' to synchronize their data with the primary.

VMware Aria Suite Lifecycle does an hourly health check on the VMware Identity Manager appliances, and if the replication delays or node 'down' events occur during these health checks, VMware Aria Suite Lifecycle issues a health status notification.

Resolution

To resolve the issue correct the underlying network issue that's causing the replication delays between Identity Manager nodes.

Workaround:

To workaround the issue increase the auto_recovery_replication_delay_threshold setting

1. Find the typical replication delays on this setup:

    cd /var/log/pgService
    zgrep "SECONDARY_1_REPLICATION_DELAY" * | cut -d '=' -f2 | uniq | sort -rn | head -n50
    # Similarly for SECONDARY_2
    zgrep "SECONDARY_2_REPLICATION_DELAY" * | cut -d '=' -f2 | uniq | sort -rn | head -n50

2. Based on the highest delays reported above, configure the new threshold in /usr/local/etc/lcm-pgpool.conf

for example:

    auto_recovery_replication_delay_threshold=100000

Note: 100000 is just a reference value, we recommend setting the actual threshold value based on the max values seen in the output of the commands listed in the step 1.

3. Perform this change on all the identity manager nodes in the cluster.

Additional Information

Impact/Risks:
Replication delays occur due to a slow underlying network or other unforeseen network characteristics. To alleviate the notification fatigue, We can configure the 'Cluster-Auto-Recovery' to consider a higher threshold for replication delays.

What is the maximum supported value for the replication delay threshold ?
- There is no limit as such.

Feedback

thumb_up Yes

thumb_down No