Node inaccessible after removing from cluster
search cancel

Node inaccessible after removing from cluster

book

Article ID: 226710

calendar_today

Updated On:

Products

CA Privileged Access Manager (PAM)

Issue/Introduction

In the process of updating Security Certificates, we are removing one node at a time from the cluster in the primary site using the LEAVE CLUSTER button on the Configuration > Clustering page in order to verify and apply a new server certificate, allow a reboot and add the node back to the cluster. This worked for the first node, but for the second node the UI reported an error after about 5 minutes and we have not been able to access the UI via browser or PAM client since.

Environment

Release : 3.4.0-3.4.4 and 4.0.0

Component : PRIVILEGED ACCESS MANAGEMENT

Cause

There was a timing problem in the process of leaving the cluster, which involves taking a database backup on the node that is leaving. The backup could hang, if it started before database replication to this node stopped, and there was new replication activity during that time window.

Resolution

This problem is fixed in 3.4.5+ and 4.0.1+.

If you experience this problem and have SSH debug access enabled, please open a case with PAM Support, who can use SSH access to kill the hung "mysqlbackup" process. If SSH access is not enabled, and you can't get in as config user either using address https://<pamserver>/config/?legacy=1, a hard reboot of the appliance would be needed. This is not desirable as the node would get rebooted while in the middle of exiting the cluster.

In general we recommend to always enable SSH debug access prior to performing changes on PAM nodes that involve a reboot or a change in cluster configuration. SSH debug patches are valid for 90 days from the time they were created. If needed, open a case with PAM Support to get the latest debug patch prior to scheduled maintenance.