Active Agents shown as Not Reporting in the Enforce Console

Products

Data Loss Prevention Data Loss Prevention Endpoint Suite

Issue/Introduction

You observe that agents which are successfully communicating with an EPS (Endpoint Detection Server) and Enforce show as "Not Reporting" in the Enforce Console.

A delayed agent "Reporting" connection status update from an EPS to Enforce may cause the agent to remain in the "Not Reporting" status, until the agent stays disconnected again for longer than the agent "Not Reporting After" interval or switches to another EPS server. The "Not Reporting After" setting is found in the Enforce console under System -> Settings -> General -> Agent Connection Status Configuration.

Environment

DLP 15.x

Cause

A timed-out connection between the Detection Server Controller Service (MonitorController) and the Endpoint Service (Aggregator) can cause agent connect status attributes to not be sent to Enforce. This can be observed in the MonitorController(n).log files on Enforce:

com.symantec.dlp.communications.common.activitylogging.JavaLoggerImpl log
INFO: DC - Application handshake timer timed out  for connection number 4 at 2021-08-14 11:44:42.

Resolution

Fix for Handshake Timer Timed Out Issue

One issue that can cause this has been fixed as of 15.8 MP1 HF3 and higher. As a temporary workaround, restarting the EPS services usually allows the necessary handshake to complete, depending on Enforce load at the time.

Workaround for Failed Batch Persist, Out of Memory, or other replication/persistence failures

As of DLP 16.0 RU1 and higher, you will be able to accelerate getting agents out of false "Not Reporting" status by restarting the Detection Server Service on the EPS that they are connected to rather than waiting for them to switch servers. On service restart, each agent is granted one additional Connect Status resend. This will be especially applicable to environments with only one EPS.

Workarounds for EP Server Load Based Issues

If the above fix is in place, or you are not seeing "handshake timer timed out" in the logs, consider the following workarounds to see if any are applicable to the customer environment:

If you are using load balancing, the next time the agent switches to a different EPS, that EPS will send a new connection attribute to Enforce which will set the agent back to "Reporting". How often EPS transitions happen will depend on whether you have Source IP Persistence/Source Address Affinity enabled and what that timeout is. Agents may also commonly switch EPS when they change networks, such as connecting or disconnecting from a VPN, causing them to get a new IP address.
You can switch affected agents to a new/additional EPS to better balance overall load:
1. In the Enforce Console, navigate to System -> Agents -> Overview
2. Select the affected agents and click the "Change Server" button in the toolbar
3. Enter an EPS hostname that is different from the one the agents are currently connected to.
If a large number of agents on the same EPS are affected you can try one of the following:
1. Restart the Detection Server Service on the EPS
  1. At startup, any Connect Status objects that had been flagged to be enqueued Enforce (shouldEnqueueToMonitorController == true) will be enqueued and sent immediately.
  2. If this works around the issue, then we know that the affected agents' Connect Status objects had not yet been queued for replication to Enforce..
  3. Once Connect Status objects are successfully queued for replication to Enforce, their shouldEnqueueToMonitorController field gets set to false.

Additional Logging to Troubleshoot

Endpoint Detection Servers

In AggregatorLogging.properties configure the following settings and classes:

# update these
java.util.logging.FileHandler.limit = 10000000
java.util.logging.FileHandler.count = 50
java.util.logging.FileHandler.level = FINEST

# add these
com.symantec.dlp.applications.subsystems.attributes.connectionstatus.AgentConnectionStatusAttributeProviderSubsystem.level = FINEST
com.symantec.dlp.communications.monitorcontroller.applications.subsystems.MonitorControllerAgentAttributeValuesForwarderSubsystem.level = FINEST
com.symantec.dlp.communications.aclayer.impl.ApplicationConnectionsManager.level = FINE
Restart the Detection Server Service

Log Samples from the Aggregator(n) logs:

Sep 1, 2021 12:45:33 PM com.symantec.dlp.communications.aclayer.impl.ApplicationConnectionsManager hasSwitchedDetectionServers
FINE: Agent 'WIN10ENT' has switched detection server from 'detection2'
Sep 1, 2021 12:45:33 PM com.symantec.dlp.applications.subsystems.attributes.connectionstatus.AgentConnectionStatusAttributeProviderSubsystem onConnect
FINER: 'WIN10ENT' doesn't exist already in the connection status cache and has been newly added.

Aug 2, 2021 6:46:27 PM com.symantec.dlp.applications.subsystems.attributes.connectionstatus.AgentConnectionStatusAttributeProviderSubsystem$AgentInactiveSearchTask run
FINEST: Found an inactive agent: 'WIN10ENT', enqueing its connection status attributes for forwarding.

Enforce

In MonitorControllerLogging.properties

# update these
java.util.logging.FileHandler.limit = 10000000
java.util.logging.FileHandler.count = 50
java.util.logging.FileHandler.level = FINEST

# add this
com.vontu.monitor.controller.replicatorcommlayer.applications.agentstatus.AgentStatusAttributeListMarshallablePersister.level = FINEST
Restart the Detection Server Controller Service (aka MonitorController)

Log Samples from the MonitorController(n) logs:

Sep 1, 2021 1:36:17 PM com.vontu.monitor.controller.replicatorcommlayer.applications.agentstatus.AgentStatusAttributeListMarshallablePersister persist
FINEST: MonitorId: 2, DataId: c7611a66-4ea3-49fd-865c-fabd8097aecc, ListMarshallable: items=[agentId=WIN10ENT, listItemMarshallables=[attributeId=2, lastActiveTimeInMillis=1630524968053, lastInActiveTimeInMillis=0, lastDisconnectedTimeInMillis=0]]
Sep 1, 2021 1:36:17 PM com.vontu.monitor.controller.replicatorcommlayer.applications.agentstatus.AgentStatusAttributeListMarshallablePersister persist
FINER: Attempting to persist 1 agent status marshallables.
Sep 1, 2021 1:36:17 PM com.vontu.monitor.controller.replicatorcommlayer.applications.agentstatus.AgentStatusAttributeListMarshallablePersister persist
FINER: Elapsed time for processing 1 agent status marshallables is : 33546927 nanos (33546927 nanoseconds per agent).
Sep 1, 2021 1:36:17 PM com.vontu.monitor.controller.replicatorcommlayer.applications.agentstatus.AgentStatusAttributeListMarshallablePersister persist
FINE: AgentStatusAttributeListMarshallablePersister JDBCTemplate executed successfully.

Additional Information

Connection Sequence

An agent disconnects from an EPS, and does not reconnect for a period longer than the "Not Reporting After" interval
1. Once this interval has elapsed, the EPS that the agent was last connected to sends a disconnected status update to Enforce, which set the agent to "Not Reporting"
Later, the agent comes back online and connects to an EPS which sends a connected status because the agent is newly inserted into its connection status cache, but the connection status object from the new EPS is delayed for a significant period of time
As long as the agent continues to connect back to this same EPS so that it never ages out of its connection status cache, the impacted agent remains in the "Not Reporting" state on Enforce
1. This is because the agent's DETECTION_SERVER_ID string will be updated to match the current EPS hostname, so the EPS aggregator service doesn't send another connected status update when the agent reconnects because the DETECTION_SERVER_ID string matches its own hostname
2. The AgentConnectionStatusAttributeProviderSubsystem only sends connection status updates to Enforce in the following situations:
  1. Connected / "Reporting"
    1. The agent has never connected before, or has aged out of the EPS' agentConnectionStatusCache due to not reconnecting to this EPS for longer than the "Not Reporting After" interval
  2. Connected / "Reporting"
    1. The agent has been connected to this EPS before and still lives in its agentConnectionStatusCache but the DETECTION_SERVER_ID string sent by the agent does not match the current EPS' hostname
  3. Disconnected / "Not Reporting"
    1. The AgentInactiveSearchTask runs every 5 minutes. If it finds an inactive agent (has been disconnected for longer than the "Not Reporting After" interval), sends a connect status update to Enforce and if successful is removed from the cache.
  4. Disconnected / "Not Reporting"
    1. On startup, the EPS reload the agentConnectionStatusCache from disk, and any inactive agents in the cache that should have reported a connection status to Enforce before are retried and if successful are removed from the cache.

Source Address Affinity in F5 BIG-IP Persistence Profile