We recently experienced a major production environment outage due to this connection limit being exceeded.
"Siteminder Connection request rejected. Connection limit of "x" exceeded."
How can we prevent this type of outage in the future?
An increase in load in the environment can require re-tuning of the Policy Server to allow the Policy Server to properly handle the increased request load in a timely fashion.
Release : 12.8.0x
Component : SITEMINDER -POLICY SERVER
In order to prevent a build up of Agent connections due to increased load at the Policy Server there are a number of factors to take into account, and a number of Tuning parameters that need to be evaluated. The Policy Server must communicate with User Directories and/or User Databases in order to process the requests. The Policy server uses the Worker Threads defined in the SMConsole (Maximum Threads) to process the IsProtected, IsAuthenticated, and IsAuthorized requests from the Agents.
Policy Server Side - Increase Worker Threads (LDAP Servers)
Worker Threads: 8
PriorityThreadCount: unknown (default 5) (SitMinder Registry)
The MaxConnections setting should be set to a sufficient value to accommodate the maximum concurrent Agent Connections to the Policy server at a given point in time. On Unix systems, you need to ensure you have more File Descriptors than your MaxConnection limit.
If a review of the back-end User Directories shows fast Response Times and you still experience a build Up of Agent connections, this would point to possible over-loaded WorkerThreads, and this value may need to be increased.
An increase in Worker Threads may require a modification to the User Directory Definitions. The default install has "8" Worker Threads configured, which is a "Good Rule of Thumb", for the maximum number of Threads per User Directory Server defined in the initial Load-Balance Group for your User Directory Definitions.
The "Rule of Thumb" when configuring the Policy Server is to for every 8 Workers Threads there is a separate User Directory Server defined in the User Directory Definition's initial "Load-Balance" Group. This is to provide sufficient LDAP Handles for the number of Worker Threads defined to help prevent a bottle-neck obtaining a handle to the Directory by anyone Worker Thread. So, with "48" Worker Threads for example, your User Directory Definitions should contain "6" User Directory Servers in their initial Load-Balance Group to provide sufficient LDAP Handles (48 / 8 = 6).
Lets consider the following User Directory Definition configuration with 6 servers defined;
ldap:389 ldap:389 ldap:389 ldap:389 ldap:389 ldap:389
Note: Commas are for Failover and Spaces are for Load-balance.
Where there is only one physical Server (ldap:389), 'Alias' entries in the"/etc/hosts" file or DNS entries provide the additional Servers required to provide the additional "Individual" LDAP Handles required to support "48" Worker Threads. You would need to create "5" 'Alias' entries for this directory that could then be defined in your User Directory Definitions for "Load-balancing";
<IP of ldap> ldap
<IP of ldap> ldap1
<IP of ldap> ldap2
<IP of ldap> ldap3
<IP of ldap> ldap4
<IP of ldap> ldap5
With the 'Alias'entires, in the "/etc/hosts"or DNS, you could then configure the User Directory Definition as follows;
ldap:389 ldap1:389 ldap2:389 ldap3:389 ldap4:389 ldap5:389
When Siteminder encounters an error on a User Directory connection that fails a re-bind, the Policy Server will MARK that Directory as "BAD", and add it to the "DeadHandleList". When processing the "DeadHandleList", the Policy Server will tear down ALL connections to the named Server. Since currently ALL servers are defined in the Directory Definition as "ldap", if there is an issues on one connection, the Policy Server would tear down every connection that is available for the "48" Worker Threads, and they will all need to wait until the connections are re-established.
This is why 'Alias' entries are uses, so that the Policy Server will treat these connections as "separate" servers; "ldap" thru "ldap5". To the Policy Server, these are 6 separate User Directory Servers, and a failed connection on Server "ldap5" will not affect the connections for the other 5 Servers, and the Worker Threads would still have LDAP Handles to utilize while "ldap5" connections are re-established. Once you get your Initial Load-Balance Group configured, you can then work on your Failover strategy for the user Directory definitions.
Depending on the type of connections that are building up, you may also need to increase the "High Priority Thread Pool", which is by default "5" OTB. The High Priority Thread Pool serves the Agent connection requests. Once the connection is established the request is placed in the Normal Priority Queue for processing. The High Priority Thread count is controlled in the SiteMinder Registry key'PriorityThreadCount';
Defines the minimum number of High Priority threads. By default, this value is 5 even though the registry key does not exist. To change the value, add the registry key and configure a value.
On the agent side you can utilize the Agent's caches (MaxResourceCacheSize and MaxSessionCacheSize);
Specifies the maximum number of entries that Web Agent keeps in its resource cache. An entry contains the following information:
- A Policy Server response about whether a resource is protected
- Any additional attributes returned with the response
When the maximum is reached, new resource records replace the least recently used resource records.
See Set the Maximum Resource Cache Size.
Specifies the maximum number of users the Agent maintains in its session cache.
See Set the Maximum User Session Cache Size.
By allowing the Agent to store these "IsProtected" decisions from the Policy Server for URL's locally, the Agent can process the subsequent IsProtected requests for these URL's instead of sending the request to the Policy Server helping reduce the load at the Policy Server.
SmHost.conf and HCO's
Consider spreading out the "BootStrap" connections among the Policy Servers available by modifying the SmHost.conf files so that not all Agents are hitting the same Policy Server at start up.
Consider using multiple HCO's which have different Policy servers, and different load orders for the Policy Servers so that not all agents try to talk to the same FIRST Policy Server in it's list.
For additional information, please refer to the following sections from the R126.96.36.199 Documentation - Implementation Guide;