NSX Network Detection and Response - Troubleshooting Data Node storage issues

Products

VMware

Issue/Introduction

The purpose of this article is to provide steps to troubleshoot disk space issues along with the different symptoms.

Symptoms:

When having Data Node disk space issues, we might find any of the following symptoms:

When a Data Node system exceeds the disk usage threshold (80%) Elasticsearch will mark indices as read-only, causing the queues in the manager to pile up.

On the Manager appliance:

Running the utility lastline_test_appliance will show the following Failure:

>   FAILURE: Number of messages in ids_dhcp: 19970 exceeds threshold: 10000
> Number of messages in ids_krb: 17175 exceeds threshold: 10000
> Number of messages in ids_smb: 19149 exceeds threshold: 10000
> Number of messages in ids_tls: 18322 exceeds threshold: 10000
> Number of messages in ids_urls: 11903 exceeds threshold: 10000
> Number of messages in netflows: 18409 exceeds threshold: 10000
> Max total number of messages: 104928 exceeds threshold: 100000

Monitoring logs will show this Errors:

Running the command rabbitmqctl -p llq_v1 list_queues | grep -v -w 0$ will show a similar output:

 Listing queues
 db.analysis_completed.3104	11
 db.events.3104	2
 ids_smb	20000
 netflows	19999
 db.breach_correlation_rule_runner.3104	7
 detection_entity_network_event	19999
 ids_tls	20000
 ids_urls	20000
 ids_krb	20000
 ids_dhcp	20000

The Kibana NTA Record Counts in Dashboard - Home never update and an error when they click on individual NTA Record tiles:

On the Data Node appliance:

The monitoring logs will show a Warning like this:

In the files in the file /var/log/lastline/appliance_check.log or /var/log/elasticsearch/lldns/curator.log we will confirm the read only state of the indices:

Feb 28 05:58:54 lastline-datanode llanta-storage_llanta-storage-webrequest-dkr_1[9462]: AuthorizationException: AuthorizationException(403, u'cluster_block_exception', u'blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];')

Sometimes the indices will take too much disk space, we can query all the indices and their size to confirm it by running the command:
curl -s localhost:9200/_cat/indices?v

We can sort the output alphabetically to group together the indices types to have a better idea of what is using the most storage by adding sort to the command.
i.e:
curl -s localhost:9200/_cat/indices?v | sort

Cause

A Manager appliance was configured and doing full and incremental backups on the Data Nodes filling up the disk. As mentioned in our guide NSX Network Detection and Response - How to backup appliances (900094) , none of the appliances (Manager, Data Node, Engine and Sensor) should be used to store backups.
The Data Retention for Elasticsearch indices are set to 32 days. If the network traffic volume is at an extremely high range, it is possible that the size of the indices could cause the disk usage to rise above the 80% threshold.

Resolution

1. Confirm whether the disk is currently full or getting full (over 80%).

This can be checked by checking the Monitoring Log for (Component/Type: Disk Usage) or with the following system utility:

lastline-df -d -h --total
or
ionice du -xah --time --max-depth=3 /var | sort | egrep 'GIT'

i.e:
lastline-df -d -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/lastline--datanode--vg-root 1006G 846G 109G 89% /

If adding more disks or Data Nodes is not possible (See the Related Information section bellow), then:

2. Free up disk space by configuring a shorter data retention policy than default (32 days) by adding an override to the appliance:

In the file /etc/appliance_config/override.yaml (we should create it if not present) we need to add the following line:

pdns::data_retention::delete_delay: 'N'

Replace 'N' with a number smaller than the default 32 days for which we will keep data around; for example, 21 would set data retention to 21 days.

Re-trigger the configuration with the command lastline_apply_config -d on the command line or re-trigger the configuration in the Portal UI under Admin > Appliances > Quick links > Retrigger configurarion:

Note: The curator service is in charge of enforcing the data retention settings and deleting files, so wait for it to finish. The curator service runs once per day around 2 am UTC so we need to wait some time for it to run and finish deleting files.

If returning to the normal operation is critical, we can run the curator service manually to clear the files immediately. To do so we need to run the command:

curator --config /etc/elasticsearch-curator/curator.yml /etc/elasticsearch-curator/actions.yml

To confirm the files are being or were deleted, we can look at the log file /var/log/elasticsearch/lldns/curator.log and search for delete actions.
i.e:
grep -A 30 -i -e 'delete' /var/log/elasticsearch/lldns/curator.log

We get a similar output to the following:

2024-02-29 23:42:37,613 INFO      Preparing Action ID: 1, "delete_indices"
2024-02-29 23:42:37,615 INFO      Trying Action ID: 1, "delete_indices": delete_indices indices matching ^dhcp-|^krb-|^netflow-|^pdns-|^rdp-|^smb-|^tls-|^webrequest- and older than 18 days (based on index name)
2024-02-29 23:42:37,772 INFO      Deleting 24 selected indices: [u'netflow-20240209', u'rdp-20240209', u'tls-20240209', u'pdns-20240209', u'netflow-20240211', u'netflow-20240210', u'tls-20240211', u'tls-20240210', u'krb-20240209', u'krb-20240211', u'krb-20240210', u'dhcp-20240209', u'dhcp-20240210', u'dhcp-20240211', u'smb-20240210', u'smb-20240211', u'smb-20240209', u'pdns-20240210', u'pdns-20240211', u'webrequest-20240209', u'rdp-20240210', u'rdp-20240211', u'webrequest-20240210', u'webrequest-20240211']
2024-02-29 23:42:37,772 INFO      ---deleting index netflow-20240209
2024-02-29 23:42:37,772 INFO      ---deleting index rdp-20240209
2024-02-29 23:42:37,772 INFO      ---deleting index tls-20240209
2024-02-29 23:42:37,772 INFO      ---deleting index pdns-20240209
2024-02-29 23:42:37,773 INFO      ---deleting index krb-20240209
2024-02-29 23:42:37,773 INFO      ---deleting index dhcp-20240209
2024-02-29 23:42:37,773 INFO      ---deleting index smb-20240209
2024-02-29 23:42:37,773 INFO      ---deleting index webrequest-20240209
2024-02-29 23:42:38,448 INFO      Action ID: 1, "delete_indices" completed.

Then confirm the disk usage:
lastline-df -d -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/lastline--datanode--vg-root 1006G 724G 231G 76% /

After confirming the disk usage is optimal, we can instruct the Elasticsearch service to re-enable writing on all indices when needed. To do so:

3. From an SSH session at the CLI prompt on the Data Node, copy and paste this command (from beginning to end including single and double quotes) in one single line:

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

If the command completes successfully, you will see the {acknowledged:true} output appended to the end of the line.

4. Restart Kibana UI on Manager in a SSH session at the CLI prompt:
sudo service kibana restart

If further assistance is needed feel free to create a support request using our Customer Connect Portal:
How to file a Support Request in Customer Connect and via Cloud Services Portal (2006985)

Additional Information

For further details about Data Node sizing see: ADD LINK TO THE NEW ARTICLE ONCE IT IS PUBLISHED

Note: This article is applicable to the standalone NSX Network Detection and Response product (formerly Lastline) and is not intended to be applied to the NSX NDR feature of NSX-T.

Impact/Risks:

The Kibana UI will not display statics from the Elasticsearch database.
The Elasticsearch database on the Data Node with disk usage greater than 80% will not accept write requests for new netflow meta-data because it has been automatically switched into read-only mode.
The netflow related queues in the Manager will start to increase without being able to precess them.