Collecting Logs for High Event Backlog

Products

Carbon Black EDR

Issue/Introduction

Log collection required for troubleshooting high event backlog issues with support.

Environment

Carbon Black EDR Server: All Versions

Resolution

Answer these questions

When did the issue start?
Was there a change in the environment since the backlog started?
- Recently added sensors?
- New Applications installed or updated?
- New Deployment mechanism?
Is the system running on a VM? If so, are the resources dedicated and not shared with other VM's?
Was a minion node recently added?
Any changes in network settings?
Is the disk full? Check all disk partitions.
```
df -h
```
Is the system using Spinning or SSD disks?
What does lsblk report from the system for disk type? Provide the output
```
 sudo lsblk -d -o name,rota 
```

What does Solr believe is being used for disk type? Provide the output

curl -s 'http://localhost:8080/solr/admin/metrics?group=solr.node' | grep spins

Is site throttling being used? Check the Username > Settings > Site Throttling.
Are there network bandwidth restrictions?
Is the event forwarder being used in this environment? Where is the event forwarder installed?

Is the write core there?

Validate there is only one using this command

find /var/cb/data/solr/cbevents/cbevents_*/core.properties -exec grep -l 'writer' {} \;

Collect the following diagnostics during high backlog times. Do not restart services before collecting the data.

Start a thread dump capture (all eventful nodes). Run for 5 minutes each, then ctrl+c to exit.

while true; do kill -s 3 $(/usr/share/cb/cbquery service-pid cb-solr); sleep 3; done 
while true; do kill -s 3 $(/usr/share/cb/cbquery service-pid cb-datastore); sleep 3; done

Run this command to generate a current datastore allocations:

/usr/share/cb/virtualenv/bin/python -m cb.maintenance.cballocate.main -i datastore >> /var/log/cb/datastore_stats.txt

Generate a backlog per node report (run on the primary):

psql -p 5002 cb -c "COPY (SELECT sr.node_id, pg_size_pretty(SUM(ss.num_eventlog_bytes)) AS total_node_backlog, pg_size_pretty(ROUND(AVG(ss.num_eventlog_bytes))) AS average_node_backlog from sensor_registrations sr JOIN sensor_status ss ON ss.id=sr.id  WHERE ss.next_checkin_time > current_timestamp - (interval'24 hour') GROUP BY sr.node_id) TO '/var/log/cb/backlog_per_node_$(date +%Y%m%d).csv' WITH CSV HEADER";

Backlog by size per sensor count (run on the primary):

psql -p 5002 cb -c "COPY ((select '>1.5GB' as label, pg_size_pretty(sum(num_eventlog_bytes)), count(*) from sensor_status where num_eventlog_bytes > 1610612736 and last_checkin_time >= (current_timestamp - interval '24 hour')) UNION ALL (select '1-1.5GB' as label, pg_size_pretty(sum(num_eventlog_bytes)), count(*) from sensor_status where num_eventlog_bytes > 1073741824 and num_eventlog_bytes < 1610612736 and last_checkin_time >= (current_timestamp - interval '24 hour')) UNION ALL (select '750-1GB' as label, pg_size_pretty(sum(num_eventlog_bytes)), count(*) from sensor_status where num_eventlog_bytes > 786432000 and num_eventlog_bytes < 1073741824 and last_checkin_time >= (current_timestamp - interval '24 hour')) UNION ALL (select '500-750MB' as label, pg_size_pretty(sum(num_eventlog_bytes)), count(*) from sensor_status where num_eventlog_bytes > 524288000 and num_eventlog_bytes < 786432000 and last_checkin_time >= (current_timestamp - interval '24 hour')) UNION ALL (select '100-500MB' as label, pg_size_pretty(sum(num_eventlog_bytes)), count(*) from sensor_status where num_eventlog_bytes > 104857600 and num_eventlog_bytes < 524288000 and last_checkin_time >= (current_timestamp - interval '24 hour')) UNION ALL (select '50-100MB' as label, pg_size_pretty(sum(num_eventlog_bytes)), count(*) from sensor_status where num_eventlog_bytes > 52428800 and num_eventlog_bytes < 104857600 and last_checkin_time >= (current_timestamp - interval '24 hour')) UNION ALL (select '5-50MB' as label, pg_size_pretty(sum(num_eventlog_bytes)) , count(*) from sensor_status where num_eventlog_bytes > 5242880 and num_eventlog_bytes < 52428800 and last_checkin_time >= (current_timestamp - interval '24 hour')) UNION ALL (select '>5MB' as label, pg_size_pretty(sum(num_eventlog_bytes)) , count(*) from sensor_status where num_eventlog_bytes < 5242880 and last_checkin_time >= (current_timestamp - interval '24 hour')) ORDER BY COUNT) TO '/var/log/cb/sensor_backlog_$(date +%Y%m%d).csv' with CSV HEADER";

Run this command till complete (60 seconds).

vmstat 1 60 >> /var/log/cb/vmstat_$(date +%Y%m%d).log

Capture cbdiags (all nodes). A cbdiag.zip file will be dropped in the current working directory, commands in steps 1-5 will be collected as part of the diag capture:
```
/usr/share/cb/cbdiag
```

Additional Information

Restarting services can create a short burst in ingestion. Collecting the logs after a restart may not give the information needed for support to investigate. Collecting when services have been running for at least a few hours and backlog is high is recommended.
Answering the questions helps support narrow down where to focus on the investigation.
Troubleshooting High Endpoint Event Backlog