You experience slow performance trying to interact with a Windows Endpoint Server with many connected agents and/or doing a high volume of Two Tier Detection requests.
In the Server Aggregator log you observe a high volume of the following warnings:
Message: Rejecting existing connection with AgentId '<AgentName>' since a new connection with the same AgentId has come in.
In the endpoint agent logs, set to FINEST level, you see the following sequence of events repeatedly, with new connections being initiated after the agent backoff period, instead of after the agent's polling interval:
01/27/2020 18:08:23 | 7136 | FINEST | AgentServices.ServerCommunicatorService.ConnectionStrategy | No need to wait anymore before reconnecting because time elapsed in seconds since last connect is '30'
01/27/2020 18:08:23 | 7136 | FINEST | AgentServices.ServerCommunicatorService.ConnectionStrategy | Initiating connection with server: 'endpointserver:10443'
01/27/2020 18:08:23 | 7136 | FINEST | AgentServices.ServerCommunicatorService.ConnectionStrategy | Transitioning to CONNECTION_SUCCESS_PENDING state
01/27/2020 18:11:07 | 7136 | FINEST | AgentServices.ServerCommunicatorService.ConnectionStrategy | Transitioning to CONNECTED state
01/27/2020 18:11:22 | 19096 | FINE | Communication.CurlTransportLayer | Successfully flushed outbound data
01/27/2020 18:16:36 | 21472 | FINEST | Communication.CommLayer.ClientDataConnection | The comm layer initiated disconnection.
01/27/2020 18:16:36 | 21472 | FINE | Communication.CurlTransportLayer | Disconnecting from client side
01/27/2020 18:16:36 | 19096 | FINEST | Communication.CurlTransportLayer | TransportDisconnectionInformation [DisconnectReason: DISCONNECT_METHOD_CALLED, TransportErrorCode: NO_CONNECTION_ERROR, ErrorMessage:Disconnect method was called
01/27/2020 18:16:36 | 19096 | FINEST | AgentServices.ServerCommunicatorService.ConnectionStrategy | UNGRACEFUL_DISCONNECTION event received in CONNECTED state
01/27/2020 18:16:36 | 7136 | FINEST | AgentServices.ServerCommunicatorService.ConnectionStrategy | Transitioning to BACKOFF state
01/27/2020 18:17:07 | 7136 | FINEST | AgentServices.ServerCommunicatorService.ConnectionStrategy | No need to wait anymore before reconnecting because time elapsed in seconds since last connect is '30'
01/27/2020 18:17:07 | 7136 | FINEST | AgentServices.ServerCommunicatorService.ConnectionStrategy | Initiating connection with server: 'endpointserver:10443'
Notice in the above event sequence that the agent:
This connection/disconnection sequence is indicative of Endpoint Server load issues.
In a healthy environment, after the last "Successfully flushed outbound data" event from the agent, the Endpoint Server would wait for its EndpointCommunications. IDLE_TIMEOUT_IN_SECONDS.int, default of 30 seconds and send a close event to the agent, where you would see the following events logged near the end of the cycle:
04/24/2020 13:24:00 | 5736 | FINE | Communication.CurlTransportLayer | Sucessfully flushed outbound data
04/24/2020 13:24:30 | 5736 | FINEST | Communication.CurlTransportLayer | Inbound TLS Message: [Content Type: Alert, Description: close notify]
04/24/2020 13:24:30 | 5736 | FINE | Communication.CurlTransportLayer | Remote peer closed the connection
04/24/2020 13:24:30 | 5736 | FINEST | AgentServices.ServerCommunicatorService.ConnectionStrategy | GRACEFUL_DISCONNECTION event received in CONNECTED state
04/24/2020 13:24:30 | 3124 | FINEST | AgentServices.ServerCommunicatorService.ConnectionStrategy | Transitioning to POLL_INTERVAL_WAIT state
This is typically caused by over-utilization of the endpoint server, such as more agents connected than what the server can handle, or a higher-than-expected volume of TTD requests, or large number of policies or complex policy conditions.
When the Endpoint Server is overloaded, it may not send idle disconnects to agents in a timely fashion, which leads the agents to disconnect themselves after their no traffic timeout period (CommLayer.NO_TRAFFIC_TIMEOUT_IN_SECONDS.int) which has a default value of 300 seconds. This is treated as an ungraceful disconnection, and the agent transitions to a backoff state, instead of a POLL_INTERVAL_WAIT state.
Some ways to resolve this are to reduce whatever is driving the bottleneck, which could take the form of:
The issue discussed in this article is specific to an environment where the following best practices have already been observed. A different root cause of the "Rejecting existing connection" issue would be setting the Endpoint Server EndpointCommunications. IDLE_TIMEOUT_IN_SECONDS.int value higher than the Load Balancer Connection Idle Timeout.