For upgrades or hotfixes that require the cluster to be stopped, the following procedure can minimize downtime:
Phase 1: Have one node from each secondary site leave the cluster, one at a time, and patch them. - No downtime
Phase 2: Stop the cluster, patch two of the three primary site nodes, then cluster them with the ones updated in Phase1. - > Shorter downtime, because only two out of X cluster nodes need to be patched while the cluster is turned off.
Phase 3: Patch the remaining nodes and have them join the cluster one by one. - No downtime
The first node was taken out of the cluster successfully. There was a UI timeout after 5 minutes, but logging back on afterwards had no problem and it looked like the node was ready for patching, which is what the PAM admin proceeded with.
However, when trying to make the next node leave the cluster, the following error was observed:
PAM-CMN-5201: Failed to leave the cluster. PAM-CMN-5200: The cluster configuration is being updated on xxx right now, please try again later.
PAM cluster with many nodes and a large database.
The Leave Cluster action may take more than 10 minutes on each node, if the database is large. In that case almost all the time is spent on creating a database backup.
The PAM UI does not wait that long. It gives a timeout error after 5 minutes. A PAM admin may or may not wait a few more minutes, then log back on. The PAM UI looks normal. The admin may conclude that the Leave Cluster action is finished and proceeds with patching, which requires a reboot. This kills the ongoing Leave Cluster workflow and the remaining cluster nodes retain a "member_is_being_updated" marker file that prevents the next node from leaving the cluster, running into "PAM-CMN-5201: Failed to leave the cluster. PAM-CMN-5200: The cluster configuration is being updated on xxx right now, please try again later." errors.
You can avoid the problem by checking the Configuration > Clustering pages after logging back on. If you still see the node being a cluster member, the Leave Cluster action is not complete yet. Eventually the UI will become unavailable again for a short time while the tomcat service restarts, and after that you should find that the cluster details are gone on this node.
Note that the primary site nodes will remove this node from their cluster status page earlier in the process. Seeing the node missing from the list there does NOT guarantee that the Leave Cluster action is complete.
If you have the problem already and need to get over it quickly, contact PAM Support. The offending marker files are expected to be cleaned up automatically within 24 hours, but PAM Support could do it manually using SSH Debug access to the remaining nodes in the cluster.
As of September 2024 this problem potentially could occur on any current PAM release, including 4.1.8 and 4.2. PAM Engineering is looking into changing the Leave Cluster action so that the UI will not time out until it is complete.