CA Single Sign On Secure Proxy Server (SiteMinder)AXIOMATICS POLICY SERVERCA Single Sign On SOA Security Manager (SiteMinder)CA Single Sign-On
We have identified an unexpected behavior in CA SSO 12.7 bi-cluster setup that leads to down time condition. We perceive it as a deviation from the intended functionality and would like it to be resolved by producing a patch for current or closest future release. Here is the scenario.
The Setup: CA SSO PS infrastructure is setup in 2 clusters with 3 nodes in each (1.1, 1.2, 1.3 and 2.1, 2.2, 2.3 respectively) and a failover threshold of 50%. The "enable failover" feature between cluster nodes is turned off.
The bug: In a failover scenario we managed to reach a reproducable state where the entire agent infrastructure was down while 2 out of 6 Policy servers were still up but idling by executing the following failover test scenario:
1. Nodes 1.1 and 1,2 are shutdown. Result: All agents gradually failover from Cluster 1, as expected, due to availability dropping to 33% and Cluster 2 becoming preferred. 2. Node 2.2 is shutdown. Result: All agents are staying on Cluster 2 because it still is on 66%. 3. Node 2.3 is shutdown. Result: All agents are down and disconnected from Cluster 1 and Cluster 2 which still have 33% capacity each.
This is obviously a problem: 2 servers are still up but do nothing, while the entire agents environment in both data centers is down. The only way to workaround the bug is to NOT use failover threshold at all, i.e. setting it to 33% so that agents keep hammering the poor cluster 1 until it faints off, all the while cluster 2 would enjoy its 100% capacity. This has to be addressed.
Here's a sample to illustrate it :
We have 6 Policy Servers configured in 2 clusters as follows:
Cluster A : 1.1, 1.2, 1.3 Cluster B : 2.1, 2.2, 2.3
The failover threshold is set to 50%, which means that the cluster will be considered down when there is a minimum of 50% of the Policy Servers in that cluster unavailable. They do the following:
1. Nodes 1.1 and 1.2 are shutdown. Result: All agents gradually failover from Cluster 1, as expected, due to availability dropping to 33% and Cluster 2 becoming active. 2. Node 2.2 is shutdown. Result: All agents are staying on Cluster 2 because it still is on 66%. 3. Node 2.3 is shutdown. Result: All agents are down and disconnected from Cluster 1 and Cluster 2 which still have 33% capacity each.
We expect (as the doc mentioned) to have requests still going to Cluster B available nodes.
Release: MSPSSO99000-12.8-Single Sign-On-for Business Users-MSP Component:
The behavior observed is expected and working as designed. To avoid the cluster to be considered down when there is still 1 Policy Server up, the Failover threshold should be set to less than 30%. Another option is to set a third cluster with all the Policy Servers to there would be the two available nodes there.