We just had an issue with PAM primary Node we received several alerts that the appliances were losing connection. It lasted 15 or 20 mins, and several users were kicked out of their current session. Our secondary's did not go down and it looks like it did not fail over for users properly. I am attaching the logs.bin files for all the primary appliances.
A PAM appliance to become inaccessible if the CPU utilization goes too high. There have been several reasons for this but in this case the NFS recording share had been taking offline while several SSH recoded sessions were still in progress. This lead to more than 90% utilization which deprived the system resources and even cause cluster services to stop responding. See internal reference for more details on how to validate this cause.
Release : 3.4
Component : PRIVILEGED ACCESS MANAGEMENT
Restoring the NFS mount service resolved this issue but in other cases may require additional actions like a support engineer logging in through ssh to manually clean up or recycling the node.