Troubleshooting High Endpoint Event Backlog

Products

Carbon Black EDR (formerly Cb Response)

Issue/Introduction

Troubleshooting steps for high aggregate sensor event queue backlog.

Environment

Carbon Black EDR Server: All Supported Versions

Cause

Backlog is commonly caused by one or more of these:

Environmental Updates.
Disk Full.
Missing writer core.
Incorrect settings or configuration.
Poor performing queries or watchlists.
Networking.
Slow disk speeds.
Not enough resources (CPU and/or RAM).
Noisy applications or endpoints.

Resolution

Environmental Updates

Application and OS updates can create temporary spikes in the backlog queue.
- Consider pushing updates in waves to avoid the server taking all events at the same time.
- Adding additional resources or minions can help handle these abnormal spikes.
Users logging in for the first time Monday morning can create a temporary spike.
- Even in sleep mode the endpoint is still generating events that accumulate.
Monitor that backlog is trending downward. If backlog stays level or continues to rise, continue with troubleshooting.

Disk Full

Confirm partitions are not full. This includes /tmp
```
df -h
```
How to Clear Disk Space Safely on the EDR Server
Restart services after clearing disk space.

Missing Writer Core

Validate that there is a single writer core. (excluding an eventless primary)

find /var/cb/data/solr/cbevents/cbevents_*/core.properties -exec grep -l 'writer' {} \;

Writer Core is Missing or Corrupted
Restart services.

Poor performing queries or Watchlists.

Finding Poor Performing Watchlist Queries

Incorrect Settings or Configuration.

If the system is running on a VM. Validate the resources allocated are dedicated and not shared.
- Sharing with other VM's can slow down the request for resources during high load times when Datastore and Solr need them.
VM is reporting the incorrect disk type when using SSD.
1. The application uses lsblk to determine the disk type. On VM's this may come back as spinning disk causing Solr to startup with less merge threads.
2. If this machine is verified as using SSD, edit the "SolrDiskType" in /etc/cb/cb.conf to "ssd":
```
SolrDiskType=ssd
```
3. Restart services.
If the Event Forwarder is installed.
1. Validate that it is not installed on each node (for a cluster). The Event Forwarder should only be installed on the primary server.
```
rpm -qa | grep forwarder
```
2. Validate if the event forwarder has high resource usage.
```
top -bc | grep -v 'grep'| grep 'event-forwarder' 
```
3. Options:
  1. Consider adding resources to the server or hosting the event forwarder on a separate server if resource usage is high.
  2. Consider adding another minion and setting the primary to eventless.
  3. Reduce what is collected by the event forwarder. Raw sensor events will require more resources to keep up.

Networking

Validate the eventful node is reachable from an assigned sensor over port 443. A sensor can show online and report it's backlog by having the connection to the primary, however be unable to submit the data to the minion.
1. Validate the eventful node information is correct.
  1. Log into the EDR Console as a Global Admin.
  2. In the top right, select your username and click "Settings".
  3. Find the "Server Nodes" tab and click.
  4. Validate the IP/FQDN of each "Node URL" is correct.
2. For FQDN based, validate hostname is resolvable for the minion node:
```
nslookup <hostname>
```
3. Validate the connection through port 443:
  1. Windows Powershell:
```
Test-NetConnection <fqdn or ip> -Port 443
```
  2. Mac/Linux command line:
```
nc -v -z -w2 <ip or fqdn> 443
```
Validate with your network team that SSL inspection is not being used between the sensors and minion node.

Slow Disk Speeds

Validate the count of blocked processes waiting for disk IO operations to complete.
1. Run this command till complete (60 seconds). This must be captured while backlog is high and services have not been recently restarted.
2. Column b for block process count, wa for I/O wait.
```
vmstat 1 60
```
Validate read and write waits.
1. Run this command till complete (30 seconds). This must be captured while backlog is high and services have not been recently restarted.
2. r_await, w_wait, and %iowait. This can help determine if read or writes are too slow in this disk.
```
iostat -d -m -x -t 5 6
```
Generally I/O waits above 6% can create bottlenecks in ingestion. See "Additional Notes" on disk speeds.

Not Enough Resources (CPU/RAM)

Validate your server meets the Operating Environment Requirements (OER).
- Environmental changes such as added sensors or new noisy applications can bring a server out of OER.
For systems meeting OER and disk speeds are adequate, configuration changes may be available with the help of support.

Noisy Applications or Endpoints

Event filtering may be required. How to determine top noisy and chatty hosts and processes
What Options are Available For Filtering Event Data?

Additional Information

Aggregate Sensor Event Queue vs Aggregate Sensor Binary Queue:
- Event queue should be the main concern. This is data that is searchable and alerts are created on for process and binaries metadata.
- Binary queue is the physical .zip files that are downloaded from the binary information. This queue does not affect events showing in the console. Backlog will be inflated for these when new binaries are seen. The server only collects one .zip from a sensor, the other sensors will delete the zip locally once the associated event has been sent. The .zip files can be up to 24mb in size creating additional inflation.
Backlog may not be resolved immediately.
- It may take days or even weeks for backlog to get to a normal state. This depends on disk speeds, resources and amount of backlog when a resolution is attempted.
- The goal is to validate backlog is trending downward until normal.
Backlog should never be zero. Sensors should always be collecting data to send.
Disk Speeds: This section attempts to collect a live look. However, keep in mind that iowait may show good at the time of this capture if it's not in stress at that moment in time. If I/O waits show good with these results, you may be asked by support to run the disk qualifier utility to evaluate the disks ability to handle the load of documents per day in your environment.
If backlog continues to be an issue or assistance is needed in validating the results, open a support case and provide this information. Collecting Logs for High Event Backlog