Endpoint Server running slowly, Aggregator logs show 'Rejecting existing connection with AgentId...'

book

Article ID: 189528

calendar_today

Updated On:

Products

Data Loss Prevention Endpoint Suite

Issue/Introduction

You experience slow performance trying to interact with a Windows Endpoint Server with many connected agents and/or doing a high volume of Two Tier Detection requests.

Symptoms include:

  • High disk utilization
  • High CPU utilization
  • Lag when interacting with the Windows Explorer shell

 

In the Server Aggregator log you observe a high volume of the following warnings:

Class: com.symantec.dlp.communications.aclayer.impl.ApplicationConnectionsManager
Method: logRejectionAndInitiateDisconnect
Level: WARNING
Message:  Rejecting existing connection with AgentId '<AgentName>' since a new connection with the same AgentId has come in.

In the endpoint agent logs, set to FINEST level, you see the following sequence of events repeatedly, with new connections being initiated after the agent backoff period, instead of after the agent's polling interval:

01/27/2020 18:08:23 |  7136 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | No need to wait anymore before reconnecting because time elapsed in seconds since last connect is '30'
01/27/2020 18:08:23 |  7136 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | Initiating connection with server: 'endpointserver:10443'
01/27/2020 18:08:23 |  7136 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | Transitioning to CONNECTION_SUCCESS_PENDING state
01/27/2020 18:11:07 |  7136 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | Transitioning to CONNECTED state
01/27/2020 18:11:22 | 19096 | FINE    | Communication.CurlTransportLayer | Successfully flushed outbound data
01/27/2020 18:16:36 | 21472 | FINEST  | Communication.CommLayer.ClientDataConnection | The comm layer initiated disconnection.
01/27/2020 18:16:36 | 21472 | FINE    | Communication.CurlTransportLayer | Disconnecting from client side
01/27/2020 18:16:36 | 19096 | FINEST  | Communication.CurlTransportLayer | TransportDisconnectionInformation [DisconnectReason: DISCONNECT_METHOD_CALLED, TransportErrorCode: NO_CONNECTION_ERROR, ErrorMessage:Disconnect method was called
01/27/2020 18:16:36 | 19096 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | UNGRACEFUL_DISCONNECTION event received in CONNECTED state
01/27/2020 18:16:36 |  7136 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | Transitioning to BACKOFF state
01/27/2020 18:17:07 |  7136 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | No need to wait anymore before reconnecting because time elapsed in seconds since last connect is '30'
01/27/2020 18:17:07 |  7136 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | Initiating connection with server: 'endpointserver:10443'

….

Notice in the above event sequence that the agent:

  • Is initiating a connection 30 seconds after its last disconnect, which is its backoff period
  • Takes over two minutes to transition from CONNECTION_STATUS_PENDING to CONNECTED
  • About 5 minutes after flushing its last outbound data payload, disconnect is initiated from the client side, i.e. UNGRACEFUL_DISCONNECTION
  • Transitions back to a backoff state

This connection/disconnection sequence is indicative of Endpoint Server load issues.

In a healthy environment, after the last "Successfully flushed outbound data" event from the agent, the Endpoint Server would wait for its EndpointCommunications. IDLE_TIMEOUT_IN_SECONDS.int, default of 30 seconds and send a close event to the agent, where you would see the following events logged near the end of the cycle:

04/24/2020 13:24:00 |  5736 | FINE    | Communication.CurlTransportLayer | Sucessfully flushed outbound data
04/24/2020 13:24:30 |  5736 | FINEST  | Communication.CurlTransportLayer | Inbound TLS Message: [Content Type: Alert, Description: close notify]
04/24/2020 13:24:30 |  5736 | FINE    | Communication.CurlTransportLayer | Remote peer closed the connection
04/24/2020 13:24:30 |  5736 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | GRACEFUL_DISCONNECTION event received in CONNECTED state
04/24/2020 13:24:30 |  3124 | FINEST  | AgentServices.ServerCommunicatorService.ConnectionStrategy | Transitioning to POLL_INTERVAL_WAIT state

Cause

This is typically caused by over-utilization of the endpoint server, such as more agents connected than what the server can handle, or a higher-than-expected volume of TTD requests, or large number of policies or complex policy conditions.

 

When the Endpoint Server is overloaded, it may not send idle disconnects to agents in a timely fashion, which leads the agents to disconnect themselves after their no traffic timeout period (CommLayer.NO_TRAFFIC_TIMEOUT_IN_SECONDS.int) which has a default value of 300 seconds. This is treated as an ungraceful disconnection, and the agent transitions to a backoff state, instead of a POLL_INTERVAL_WAIT state.

Resolution

Some ways to resolve this are to reduce whatever is driving the bottleneck, which could take the form of:

 

  • If you have more than one Endpoint Server, ensure that that your servers are distributed among them evenly
  • Add more Endpoint Servers and spread the load of agents across them
  • Reduce the number of TTD policies by using Described Content Matching conditions
  • Reduce the size of your policy execution matrix by lowering the condition and exclusion counts on existing policies
  • IF in the agent FINEST logs you are seeing "Transitioning to BACKOFF state", then perhaps as a temporary workaround, you could raise the agent's backoff period (ServerCommunicator.INITIAL_CONNECT_BACKOFF_DURATION_SECONDS.int) so that the agent won't be reconnecting more often than defined by its polling interval due to this condition. This could lessen the load on the Endpoint Server somewhat.

Additional Information

The issue discussed in this article is specific to an environment where the following best practices have already been observed. A different root cause of the "Rejecting existing connection" issue would be setting the Endpoint Server EndpointCommunications. IDLE_TIMEOUT_IN_SECONDS.int value higher than the Load Balancer Connection Idle Timeout.