How flow reindexer works in Security Analytics

Products

Security Analytics

Issue/Introduction

The Flow Reindexer is intended to scan previous flow indexing for unindexed flows that should be indexed. If the classification DPI engine is too busy or misses indexing a packet, the flow reindexer will find it after a scan for that time period.

When it is enabled, it will schedule all time periods to be scanned, looking for any flows that have not been indexed. The scanning and re-indexing can be a very resource intensive process and does not run during high system loads. The re-indexing process will check for cpu load and will wait until it drops to an acceptable level.

Cause

Sensors can become overwhelmed with inbound packets with a combination of Actions, Alerts, Extractions, and Reports. The first priority is to capture the packets and second is to index the packets. The indexing process may not find sufficient cpu or memory resources to index all packets and flows as they are captured.

A slot containing a non-indexed packet is an "unindexed slot". These are marked in blocks of five slots. Five slots is roughly 300MB even though the unindexed portion may be very small. This is done for maximum efficiency in capture and to be certain everything is reindexed that needs to be.

The flow reindexer requires a large amount of resources to scan all old flows. The flows typically need reindexing only when the load is high. On a box that needs re-indexing, the flows cannot be processed because the flow is too high.

The flow-reindexer process can be susceptible to high memory, cpu, and high disk utilization. This can overwhelmed the sensor, which dramatically reduced the capture capabilities.

Resolution

If you suspect that the reindexer is causing issues with your appliance, make sure that you are running the latest supported version and then contact support for further assistance.

Workaround

If you suspect that the reindexing process is causing adverse affects on the appliance, you can disable the automatic flow reindexing. Manual processes can still be submitted but the host will not scan for unindexed flows on it's own automatically.

The setting is in /etc/solera/config/apps_config.json. The line to disable automatic jobs reads "disable_jobs":1. disable_jobs = 1 (1 would disable automatic reindexing, while allowing manual reindexing. The rest: 0 = enable all, 2 = disable manual, 3 = disable all).

The simple method to disable automatic reindexing:

reindexer_config_util -d 1
service solera-reindexerd restart

To remove all current reindexing jobs:

1) To display all jobs in the queue or table called "retrospective jobs"

echo 'select * from retrospective_jobs' | su - postgres -c 'psql dsweb'

2) Note which job "id" does not have a status of 100 (100% complete) or 0 (not started yet). This means that the job is currently running. There should only be one of these.

3) Stop the flow reindexer service:

service monit stop
service solera-reindexerd stop

4) Remove all existing jobs in the reindexer queue

echo 'truncate table retrospective_jobs' | su - postgres -c 'psql dsweb'

5) Verify the queue is empty

echo 'select * from retrospective_jobs' | su - postgres -c 'psql dsweb'

6) Terminate any jobs with a status of neither 100 nor 0. The job IDs should have been recorded in step 2 above.

flow_reindexerd -j "ID_NUM" -t (For example, flow_reindexerd -j 247 -t)

7) Restart the indexing service

service restart solera-shaft

8) Start the flow reindexer to allow manual jobs

service reindexerd start; sleep 30; service monit start

You can verify that no new jobs are being created by running: echo 'select * from retrospective_jobs' | su - postgres -c 'psql dsweb'

There should be no jobs with a source of 1 (automatic). Only those that have submitted through the reprocessing requests which will have a source of 2.

Determining If Reindexing is Impacting the Performance
There should not be large numbers of generation skipped messages in the /var/log/messages file (one a day or week is normal)

As root run: grep generation /var/log/messages

Sample log messages:

Aug 15 13:32:54 username /usr/sbin/shaft[7332]: WARNING: Kernel surpassed indexing process by 5 slots. (generation now 83244, was 83243, index 4)
Aug 15 15:29:12 username /usr/sbin/solera-metad-flat[7302]: WARNING: Kernel surpassed indexing process by 5 slots. (generation now 97982, was 97981, index 3)
Aug 20 00:59:16 username /usr/sbin/shaft[7332]: WARNING: Kernel surpassed indexing process by 5 slots. (generation now 252703, was 252701, index 0)
Aug 20 13:29:01 username /usr/sbin/shaft[7332]: monitor_dummy_timeouts Interface 7 is stale for queue processor 0
Aug 20 13:29:01 username /usr/sbin/shaft[7332]: monitor_dummy_timeouts Interface 7 is stale for queue processor 1
Aug 20 13:29:01 username /usr/sbin/shaft[7332]: monitor_dummy_timeouts Interface 7 is stale for queue processor 2

As root run: grep LRU /var/log/messages

Sample log messages:

Jan 2 08:28:02 hostname kernel: : Last message 'solera: net interfa' repeated 9 times, supressed by syslog-ng on hostname.example.com
Jan 2 08:28:02 hostname kernel: : solera: GetAvailableLRURead fails: reserve 33 avail 32
Jan 2 08:28:04 hostname kernel: : Last message 'solera: GetAvailable' repeated 4 times, supressed by syslog-ng on hostname.example.com
Jan 2 08:28:04 hostname kernel: : solera: GetAvailableLRURead fails: reserve 33 avail 31
Jan 2 08:28:05 hostname kernel: : solera: GetAvailableLRURead fails: reserve 33 avail 32

Checking current Re-Indexing Queue

------+--------+-------------+--------------+---------+--------+--------------+--------------+-----------

265 | 1 | 1407944991 | 1407948592 | 1 | 100 | 1408008466 | 1408008466 | 0

264 | 1 | 1407941387 | 1407944991 | 1 | 100 | 1408008404 | 1408008404 | 0

276 | 1 | 1408010646 | 1408014327 | 1 | 0 | | | 0

270 | 1 | 1407962998 | 1407966598 | 1 | 100 | 1408008776 | 1408008776 | 0

279 | 1 | 1408029006 | 1408035589 | 1 | 0 | | | 0

271 | 1 | 1407966598 | 1407970199 | 1 | 100 | 1408008838 | 1408008838 | 0

273 | 1 | 1407988827 | 1408003401 | 1 | 0 | 1408043185 | | 1783649

277 | 1 | 1408014327 | 1408017939 | 1 | 0 | | | 0

282 | 1 | 1408049615 | 1408057422 | 1 | 0 | | | 0

280 | 1 | 1408035589 | 1408045759 | 1 | 0 | | | 0

(61 rows)

id = job id number
source = 1 (automatic), 2 (manual)
stime = start time of data in the job (hex formatted)
etime = end time of data in the job (hex formatted)
status = how far along the job is in the indexing process where 0 indicates it has not started and 100 indicates it is 100% done.
job_start = when the job started
job_end = when the job completed
slot_done = how many slots were re-indexed where 0 indicates no re-indexing was needed or it has not run.