How does load balancing work in a managed Symantec Endpoint Protection (SEP) environment?
Load balancing the SEP clients in a managed environment is a process to maintain a reasonable and dynamic disbursement of SEP clients between Symantec Endpoint Protection Managers (SEPMs). As this process is completely random, in no way does this process imply that there should or will be an equal disbursement of SEP clients between all existing management servers.
SEPM and client roles in load balancing
SEPM is used to create, maintain, and provide the Management Server List (MSL) to a SEP client using the Sylink.xml file. No other processes are involved and the SEPM does not interact with the Load Balancing process in any other way. The MSL is divided into priority levels by defining priority groups of SEPM's, such as Priority 1, Priority 2, etc.
SEP clients are responsible for implementing the load balancing process independently of other SEP clients using the MSL of SEPMs stored in the Sylink.xml file. The 24 hour clock is reset if the client is rebooted, the SEP client services are reset, or if the client is unable to connect to the last SEPM it connected to
Overview of the process the SEP client uses to select a SEPM
Client SEPM selection is based on the client randomly selecting a SEPM server from the list of servers MSL stored in the client Sylink.xml file.
The algorithm used by the SEP client to randomly select a server from the MSL works as follows;
"Choose the index of rand() % nServerCountInCurrentPriority"
For more information about the C++ function rand() http://www.cplusplus.com/reference/cstdlib/rand/
The programming function "rand()" allows for the same result to be repeated, meaning there is no limit as to the number of times the same selection (result) can be made consecutively. Each SEP client performs this random selection independently of any other SEP client and is only limited in selection choice by the number of SEPM servers listed in the MSL and the Priority Groups that are defined. The SEP client has no awareness of any other SEP client's decisions nor does it have any idea of how many clients each SEPM has. The process is completely random and 1 or more SEPMs at any given time may have more clients than any other SEPMs in the same MSL priority group. Statistically the chances that every SEP client will choose the same SEPM is very low, however because each SEP client is allowed free choice to randomly select any SEPM from the MSL 10 out of 10 clients could all randomly select the same SEPM.
Step by step process of SEPM selection from MSL
1. Start with the "MSL Priority 1" list, regardless of what SEPM server the client is currently connected to.
2. Randomly select a SEPM server in the "MSL Priority 1" list and attempt to connect.
3. If the SEPM server does not respond or is down, randomly select another SEPM server within the "MSL Priority 1" list.
4. Repeat the above process until a SEPM server responds or until all SEPM servers contained in the "MSL Priority 1" list have been tried and failed to respond.
5. Move to the next down level "MSL Priority X" list and perform the same process until a SEPM responds or the list has been exhausted.
6. If there are additional "MSL priority X" levels continue attempting connections through these consecutive levels until all levels and all listed management servers have been exhausted
7. Return to step (1) and begin the entire process again.
Factors that can influence a SEPM selection - SEP client "UseLastServer" registry value
By default, the SEP client keeps track of the last SEPM it connected to and will always attempt to reconnect to that SEPM prior to accessing the MSL contained in the Sylink.xml. If you would like to make a client choose a SEPM randomly, you need to add/modify the following registry value:
HKEY_LOCAL_MACHINE\SOFTWARE\Symantec\Symantec Endpoint Protection\SMC\SYLINK\SyLink\UseLastServer = 0.
("HKEY_LOCAL_MACHINE\SOFTWARE\WOW6432Node\..." on 64-bit machines.
Notes: If this value does not exist or it is not 0 then the SEP client will use the last server stored in "LastServer" value located in same SyLink registry key. UseLastServer is a DWORD value, and LastServer is REG_SZ.
How long a SEP client will attempt to connect to a SEPM before making another SEPM selection begins by understanding how a SEP client actually maintains a connection to a SEPM. In any event, if the selected SEPM is not responding at all it is assumed to be "down" and the selection process continues. Several factors combine to ascertain if a SEPM is down and for the client to make another SEPM selection.
SEP environment loads
How long the SEPM console will take to show a SEP client offline after it disconnects is managed as follows:
A thread in Secars maintains a data structure that contains the SEP clients' online/offline information. This thread will run and check the SEP clients' online status every 30 seconds. If it detects that there is a client who has not connected for 2 heartbeats, it changes the data to "client offline" status.
Additionally, on the java side, another thread (in Tomcat) periodically retrieves the data structure in secars and also maintains this data structure. It obtains the data from secars every 20 seconds. As an example, when a SEPM console user first goes to the Clients TAB or clicks the refresh link on the Client TAB, the SEPM console will retrieve this data structure in Tomcat and show the SEP client status in the SEPM console.
Based on the above information the maximum amount of time between a SEP client disconnect and the SEPM console displaying it offline would be (2 * Heartbeat + 20 seconds + 30 seconds), meaning the SEP client could be searching for and connected to another SEPM at the same time it was still showing connected in the original SEPM console it started with. Any snapshot in time of the disbursement of SEP clients to SEPMs in any environment cannot be a finite representation because of how the formula shown above works and the dynamic nature of the client.
In addition there is a Sylink backoff algorithm:
"If there is an error, or the client is unable to connect to any server, the Sylink will 'backoff'." The first time there is an issue, the Sylink will backoff for 32 seconds. For every consecutive connection failure, the Sylink will backoff exponentially, up to a maximum time of 2048 seconds (34 minutes and 8 seconds). Once the backoff state reaches the maximum time, it will continue to backoff in intervals of 2048 seconds until a connection is made. Once a connection is made, the backoff counter is reset.
Several networking issues can affect SEP client disbursement in any environment. Network design, slow DNS resolution, and limited bandwidth locations can affect TCP connection times and caused dropped packets preventing successful and consistent communications between SEP clients and SEPMs.
Troubleshooting load balancing
If the client doesn't have an adequate established network connection then this would be deemed a failure to connect and result with an HTTP error level response most likely in the 500 series of errors. The SEP client uses HTTP/HTTPS to connect to the SEPM. A 400 or 500 level response in Internet Information Services logging indicates an issue and could signal a failed heartbeat to the SEPM. A good response of 200 is the desired result.
Tomcat communications is dependent on the TCP/HTTP(s) protocol, configured network devices and the operating system configurations. If the SEP client receives an error indicating an issue with communication in the 400 & 500 level after 2 heartbeats including the backoff algorithm the SEPM will be viewed as "down".
Netstat (or equivalent)
If the SEP clients are configured in "Push" mode, Netstat can used to verify how many clients are connected to your SEPM box. This requires you to connect to each SEPM server and run Netstat on all SEPMs within a short amount of time. You then would have to review the Netstat results to see how closely they match to the SEPM reports of the number of clients it manages.
Using a Proxy
Maintaining a proxy server within your networking environment to manage connection traffic can provide a log of when each client connected and how they connected. This way you can review the connection history without having to watch everything live. You can even have your Proxy system log connection data into a database. This method allows you to summarize and graph the connection data and could allow you to even 'follow' a client to see it's connection pattern(s).
Note: Installation, configuration and use of a proxy server is beyond the scope of this document.