Quorum is lost and CAPAM production is down
search cancel

Quorum is lost and CAPAM production is down

book

Article ID: 188672

calendar_today

Updated On:

Products

CA Privileged Access Manager (PAM) CA Privileged Access Manager - Cloakware Password Authority (PA) CA Privileged Access Manager - Server Control (PAMSC)

Issue/Introduction

My cluster lost 2 of 3 primary nodes which resulted in quorum loss.
Two of the primary nodes are stuck "rebooting", and the Access GUI cannot be accessed.

Environment

Release : 3.3.1

Component : PRIVILEGED ACCESS MANAGEMENT

Cause

Due to environmental issues such as communication throughput, and mysql issues with versions earlier than 3.3.2, quorum loss can occur, which is losing 1 or 2 nodes.  When the loss of one Primary node occurs, quorum is lost. The Primary cluster will usually self heal in about 30 minutes or so, until it finishes syncing.  When there is a 2 Primary node quorum loss, the cluster becomes confused.  This can be due a combination of communication issues and mysql issues that have been fixed in 3.3.2.

In this case, the databases of Node1 and Node 2 were disabled, with Node 3 and all the remote sites showing Green and In Sync.

Clustering was turned off from Node 3, but Node 1 and Node 2 did not reflect this in their status.

All Primary nodes (Node1, Node2 and Node3) were rebooted, hoping this would clear them and be able to restart clustering but it did not.

 

Resolution

Node1 and Node2 displayed on the console they were trying to connect to a running cluster, and were unable to do so. 
The two nodes would re-try every 30 seconds.   After almost 1 hour the web UI returned on both Node1 and Node2.

To get the cluster back in Sync:

Logon to Node1, and leave cluster. From Node 3, “Save config to cluster”.

Logon to Node2, and leave cluster. From Node 3, “Save config to cluster”.

With all configs stored to cluster and all appliances aware the cluster as stopped, start the cluster from Node1 (or the appliance on the top of the list of Primary site).

Clustering should restart fairly quickly.

Additional Information

If remote sites normally take a long time to Sync, you can delete the Session Logs on the remote nodes to help the nodes Sync much faster to the Primary site.