Endpoint Server running slowly, Aggregator logs show 'Rejecting existing connection with AgentId...'

Products

Data Loss Prevention Endpoint Suite

Issue/Introduction

You experience slow performance trying to interact with a Windows Endpoint Server with many connected agents and/or doing a high volume of Two Tier Detection requests.

Symptoms include:

High disk utilization
High CPU utilization
Lag when interacting with the Windows Explorer shell

In the Server Aggregator log you observe a high volume of the following warnings:

Class: com.symantec.dlp.communications.aclayer.impl.ApplicationConnectionsManager
Method: logRejectionAndInitiateDisconnect
Level: WARNING
Message:  Rejecting existing connection with AgentId '<AgentName>' since a new connection with the same AgentId has come in.

In the endpoint agent logs, set to FINEST level, you see the following sequence of events repeatedly, with new connections being initiated after the agent backoff period, instead of after the agent's polling interval:

01/27/2020 18:08:23 |  7136 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | No need to wait anymore before reconnecting because time elapsed in seconds since last connect is '30'
01/27/2020 18:08:23 |  7136 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | Initiating connection with server: '<endpointservername>:10443'
01/27/2020 18:08:23 |  7136 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | Transitioning to CONNECTION_SUCCESS_PENDING state
01/27/2020 18:11:07 |  7136 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | Transitioning to CONNECTED state
01/27/2020 18:11:22 | 19096 | FINE    | Communication.CurlTransportLayer | Successfully flushed outbound data
01/27/2020 18:16:36 | 21472 | FINEST  | Communication.CommLayer.ClientDataConnection | The comm layer initiated disconnection.
01/27/2020 18:16:36 | 21472 | FINE    | Communication.CurlTransportLayer | Disconnecting from client side
01/27/2020 18:16:36 | 19096 | FINEST  | Communication.CurlTransportLayer | TransportDisconnectionInformation [DisconnectReason: DISCONNECT_METHOD_CALLED, TransportErrorCode: NO_CONNECTION_ERROR, ErrorMessage:Disconnect method was called
01/27/2020 18:16:36 | 19096 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | UNGRACEFUL_DISCONNECTION event received in CONNECTED state
01/27/2020 18:16:36 |  7136 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | Transitioning to BACKOFF state
01/27/2020 18:17:07 |  7136 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | No need to wait anymore before reconnecting because time elapsed in seconds since last connect is '30'
01/27/2020 18:17:07 |  7136 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | Initiating connection with server: '<endpointservername>:10443'

….

Notice in the above event sequence that the agent:

Is initiating a connection 30 seconds after its last disconnect, which is its backoff period
Takes over two minutes to transition from CONNECTION_STATUS_PENDING to CONNECTED
About 5 minutes after flushing its last outbound data payload, disconnect is initiated from the client side, i.e. UNGRACEFUL_DISCONNECTION
Transitions back to a backoff state

This connection/disconnection sequence is indicative of Endpoint Server load issues.

In a healthy environment, after the last "Successfully flushed outbound data" event from the agent, the Endpoint Server would wait for its EndpointCommunications. IDLE_TIMEOUT_IN_SECONDS.int, default of 30 seconds and send a close event to the agent, where you would see the following events logged near the end of the cycle:

04/24/2020 13:24:00 |  5736 | FINE    | Communication.CurlTransportLayer | Sucessfully flushed outbound data
04/24/2020 13:24:30 |  5736 | FINEST  | Communication.CurlTransportLayer | Inbound TLS Message: [Content Type: Alert, Description: close notify]
04/24/2020 13:24:30 |  5736 | FINE    | Communication.CurlTransportLayer | Remote peer closed the connection
04/24/2020 13:24:30 |  5736 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | GRACEFUL_DISCONNECTION event received in CONNECTED state
04/24/2020 13:24:30 |  3124 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | Transitioning to POLL_INTERVAL_WAIT state

Cause

This is typically caused by over-utilization of the endpoint server, such as more agents connected than what the server can handle, or a higher-than-expected volume of TTD requests, or large number of policies or complex policy conditions.

When the Endpoint Server is overloaded, it may not send idle disconnects to agents in a timely fashion, which leads the agents to disconnect themselves after their no traffic timeout period (CommLayer.NO_TRAFFIC_TIMEOUT_IN_SECONDS.int) which has a default value of 300 seconds. This is treated as an ungraceful disconnection, and the agent transitions to a backoff state, instead of a POLL_INTERVAL_WAIT state.

Resolution

The most common reason for this is the polling interval is set too low and agents are unable to complete their communication before it is restarted.

Check System > Agents > Agent Configuration > Advanced settings > ServerCommunicator.CONNECT_POLLING_INTERVAL_SECONDS.int

The default value is 900 or 15 minutes, this frequently can be seen lowered to an unsustainable value such as 10 seconds. If so set it to the default value.

Some ways to resolve this are to reduce the driver behind the bottleneck, which could take the form of:

If there is more than one Endpoint Server
- ensure that that your agents are distributed among them evenly
Add more Endpoint Servers and spread the load of agents across them
Reduce the number of TTD policies using Described Content Matching conditions
Reduce the size of your policy execution matrix
- reduce the condition and exclusion counts on existing policies
If there is an entry in an agent's FINEST logs "Transitioning to BACKOFF state"
- as a temporary workaround, raise the agent's backoff period "ServerCommunicator.INITIAL_CONNECT_BACKOFF_DURATION_SECONDS.int"
- the agent will not reconnecting than defined by its polling interval due to this condition
- This may lessen the load on the Endpoint Server

Additional Information

The issue discussed in this article is specific to an environment where the following best practices have already been observed.

A different root cause of the "Rejecting existing connection" issue would be setting the Endpoint Server EndpointCommunications. IDLE_TIMEOUT_IN_SECONDS.int value higher than the Load Balancer Connection Idle Timeout.