PAM node suddenly going into QuorunLossMode

Products

CA Privileged Access Manager (PAM)

Issue/Introduction

One of the PAM node suddenly shows "QuorumLossMode" on the console screen and becomes inaccessible.

runAsQuorumLossMode: all replication slave processes killed
runAsQuorumLossMode: begin to backup my own database
runAsQuorumLossMode: gtid_executed before backup:
    00000000-0000-0000-0000-000000000000:1-12345678,
\nXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXX:1-12349988,
\nYYYYYYYY-YYYY-YYYY-YYYY-YYYYYYYYYYY:1-12346654,
\nZZZZZZ-ZZZZ-ZZZZ-ZZZZ-ZZZZZZZZZZ:1-12345646
runAsQuorumLossMode: completed backup of my own database
runAsQuorumLossMode: gtid_executed after backup:
    00000000-0000-0000-0000-000000000000:1-12345678,
\nXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXX:1-12349988,
\nYYYYYYYY-YYYY-YYYY-YYYY-YYYYYYYYYYY:1-12346654,
runAsQuorumLossMode: reset ProxySQL configuration so that we could restart it without GR
runAsQuorumLossMode: insert GTID into the configuration_f table
Inserting saved GTID as of quorum loss value 1234567 into MySQL.
runAsQuorumLossMode: set cluster status to OFF

Environment

Release : 3.3

Component : PRIVILEGED ACCESS MANAGEMENT

Cause

[There can be other network related conditions but this article is specifically addressing the "Too many connections" condition]

This happens when the PAM node believes it is unable to contact any of the PAM cluster members and one of the cause is the network interruption.

PAM should attempt to recover by trying to contact the replication members every 30 seconds for 5 minutes then will give up and enter this QuorumLossMode.

The event that caused this could have occurred many days ago if only specific PAM node is affected.

If several PAM nodes are demonstrating the problem then it could be due to immediate network interruption that may be on-going.

daemon.log will show details and following keywords will help to locate the time when PAM entered QuorumLossMode and its reason.

keyword1=getProxySQLOnlineHostgroup2 ("result = 0" would appear when PAM fails to communicate and "result = 1" would appear when it can communicate regardless of whether it reports ONLINE)

Jan 01 01:43:01 pamhostname1 GroupReplicationManager[21269]: getProxySQLOnlineHostgroup2 result= 0
Jan 01 01:43:01 pamhostname1 GroupReplicationManager[21269]: GR king status=ONLINE number of onlineHostgroup2=0
Jan 01 01:43:01 pamhostname1 GroupReplicationManager[21269]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0

You can also search for "onlineHostgroup2count=0" to filter problematic period.

Jan 01 01:43:01 pamhostname1 GroupReplicationManager[12345]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0
Jan 01 01:43:01 pamhostname1 GroupReplicationManager[12345]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0
Jan 01 01:43:31 pamhostname1 GroupReplicationManager[12345]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0
Jan 01 01:44:01 pamhostname1 GroupReplicationManager[12345]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0
Jan 01 01:44:31 pamhostname1 GroupReplicationManager[12345]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0
Jan 01 01:45:01 pamhostname1 GroupReplicationManager[12345]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0
Jan 01 01:45:32 pamhostname1 GroupReplicationManager[12345]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0
Jan 01 01:46:02 pamhostname1 GroupReplicationManager[12345]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0
Jan 01 01:46:32 pamhostname1 GroupReplicationManager[12345]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0
Jan 01 01:47:02 pamhostname1 GroupReplicationManager[12345]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0
Jan 01 01:47:32 pamhostname1 GroupReplicationManager[12345]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0
Jan 01 01:48:02 pamhostname1 GroupReplicationManager[12345]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0
Jan 01 01:48:32 pamhostname1 GroupReplicationManager[12345]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0
Jan 01 01:49:02 pamhostname1 GroupReplicationManager[12345]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0
Jan 01 01:49:37 pamhostname1 GroupReplicationManager[12345]: Lost connection to GR king or no online hostgroup 2 found!!! status=ONLINE onlineHostgroup2count=0

keyword2=errMsg (This quickly brings you to the error events)

At 2020-01-01 01:43:01 it fails to connect to local(127.0.0.1) mysql(uag) due to "Too many connections"
Jan 01 01:43:01 pamhostname1 GroupReplicationManager[17826]: mysql_real_connect failed, errorNum = 1040, errMsg = "Too many connections", host=127.0.0.1, user=pam, password=********, db=uag
Jan 01 01:43:01 pamhostname1 GroupReplicationManager[17826]: terminate called after throwing an instance of 'Xceedium::DBException'
Jan 01 01:43:01 pamhostname1 systemd[1]: capam-group-replication-manager.service: Main process exited, code=killed, status=6/ABRT
Jan 01 01:43:01 pamhostname1 systemd[1]: capam-group-replication-manager.service: Unit entered failed state.
Jan 01 01:43:01 pamhostname1 systemd[1]: capam-group-replication-manager.service: Failed with result 'signal'.

keyword3=syslog (This is an official condition that is being reported to the syslog server so it is a significant event to look at)

Jan 01 01:49:37 pamhostname1 GroupReplicationManager[12345]: About to run following command to send syslog: perl -MXceedium::Syslogger -e 'Xceedium::Syslogger::logtosyslog(Xceedium::Syslogger::readsyslogconfig(), "PAM-CS-0015: CA PAM instance 192.168.0.10 lost its connection to other members of the primary site, and has limited functionality. Please check the availability of other primary site members to ensure that they are online and reachable. Next, check their status by visiting their respective URLs: https://192.168.0.9, https://192.168.0.8. If you cannot access other members, visit https://192.168.0.10 to repair the cluster.", "28")'

So the issue started at 01:43:01 and PAM gives up after 5 minutes and sends out syslog message that it is entering QuorumLossMode.

But this message may also have failed to be sent out if the condition was "Too many connections".

"Too many connections" is a result of existing problem and once the number of connections accumulated exceeds what the OS can handle then this PAM node will fail and become inaccessible.

The real issue could have happened days or months before this error message so the logs need to be analysed for network related issues that repeats and does not go away.

And following entry started many days ago and since then the number of connections seemed to have increased.

You may find different error condition, this is demonstrating one of the possible condition where jdbc connection is repeatedly being reported as closed.

Dec 11, 2019 09:12:34 PM com.cloakware.cspm.server.dao.impl.AnsiSQLGatekeeperDAO executeGatekeeperDatabaseQueryWithJsonResult
WARNING: AnsiSQLGatekeeperDAO.executeGatekeeperDatabaseQueryWithListResult Exception: 
java.sql.SQLException: Connection com.mysql.cj.jdbc.ConnectionImpl@5bea1833 is closed.
 at org.apache.commons.dbcp2.DelegatingConnection.checkOpen(DelegatingConnection.java:605)
 at org.apache.commons.dbcp2.DelegatingConnection.createStatement(DelegatingConnection.java:255)
 at org.apache.commons.dbcp2.DelegatingConnection.createStatement(DelegatingConnection.java:257)
 at com.cloakware.cspm.server.dao.impl.AnsiSQLGatekeeperDAO.executeGatekeeperDatabaseQueryWithJsonResult(AnsiSQLGatekeeperDAO.java:505)
 at com.cloakware.cspm.server.app.FileSynchronizationTimerTask.run(FileSynchronizationTimerTask.java:82)
 at java.util.TimerThread.mainLoop(Timer.java:555)
 at java.util.TimerThread.run(Timer.java:505)

And around this time there had been several reports of network outages in the aactrl log.

11/12/19 09:10:26 - Connection to 192.168.0.10 has been reestablished after an outage lasting 2 min(s) and 51 second(s). The gateway is currently reachable.

Resolution

Ideally software/application and the OS would be able to handle the network interruption but there are times when they cannot.

This is not PAM specific but software/application and OS in general.

In this specific condition, even a restart of the service does not drop all socket connections and a hard reboot is required.