NSX Network Detection and Response - What to do when Manager/Analyst/Pinbox disk is full
search cancel

NSX Network Detection and Response - What to do when Manager/Analyst/Pinbox disk is full

book

Article ID: 323943

calendar_today

Updated On:

Products

VMware VMware NSX

Issue/Introduction

Overview

Start by understanding which partition is full. You can do this by running the command lastline-df -d -h 

 

Resolution

The two main partitions Technical Support has seen fill up have been /data and /var. Each of these partitions has a different process to address the space. Depending on which partition is reporting full use the guidelines below.

/data full


The first step to look into /data being full is to run the command analyst_scheduler_data_usage.py 

When this command finishes, you will get an output similar to the one below:

Executing analyst_scheduler_data_usage.py as current user
-> Running Task Analyze data usage... (uuid 68efgh9b-####-####-####-########bef)
-> Connected to analyst db on host 127.0.0.1(3306) strict=on, compress=off
-> Processing data for 2010-01-01 00:00:00 - 2020-04-19 21:01:57 [1 day, 0:00:00]
-> Running Task Analyze data usage DONE (uuid68efgh9b-####-####-####-########bef)
-> Logging Timers
-> Timer global:
-> global: 45.77
Total: 887.43 GB in 2,803,904 files
Per type:
* upload: 477.08 GB (53.76%) in 1,068,241 files (38.10%)
* result.traffic_capture: 232.71 GB (26.22%) in 321,220 files (11.46%)
* result.report: 70.80 GB (7.98%) in 622,282 files (22.19%)
* result.llurl_framework_trace: 65.60 GB (7.39%) in 313,993 files (11.20%)
* result.screenshot: 38.84 GB (4.38%) in 416,285 files (14.85%)
* result.process_snapshot: 1.47 GB (0.17%) in 1,001 files (0.04%)
* result.codehash_yara_strings: 0.55 GB (0.06%) in 4,601 files (0.16%)
* result.codehash: 0.20 GB (0.02%) in 4,650 files (0.17%)
* result.generated_file: 0.12 GB (0.01%) in 335 files (0.01%)
* result.jaccine_ast: 0.03 GB (0.00%) in 48,630 files (1.73%)
* result.extracted_file: 0.01 GB (0.00%) in 55 files (0.00%)
* storage.webpage: 0.01 GB (0.00%) in 910 files (0.03%)
* result.executed_script_content: 0.00 GB (0.00%) in 260 files (0.01%)
* result.llbist_analyzer_trace: 0.00 GB (0.00%) in 1,441 files (0.05%)* result.extracted_file: 0.01 GB (0.00%) in 152 files (0.01%)

From here we can get an idea of what analysis artifacts are taking up space in /data and determine what retention settings will need to be adjusted. Most of the above values are linked to retention settings in lastline_setup , which is a Manager CLI command tool that allows access to additional configurations as described in the appliance manual.

The default retention settings are the following:

data_retention_code = 60 days
data_retention_generated_files = 21 days
data_retention_memory_dumps = 7 days
data_retention_process_dumps = 21 days
data_retention_screenshots = unlimited
data_retention_traffic_captures = unlimited
data_retention_uploads = unlimited
data_retention_webpages = 21 days

Here are the mappings of the common retention settings to adjust for various results from the initial command:

upload =  data_retention_uploads
     Note: To change this value first we need to adjust analysis_queue_backlog to be lower than data_retention_uploads.
result.traffic_capture = data_retention_traffic_captures
result.screenshot = data_retention_screenshots
result.report: = This one is not in the CLI, but the UI as Analysis results (days) by going to Admin > Appliances > Manager > Quick Links > Configuration > Data Retention as seen in the screenshot below:



After adjusting the retention settings, the service should start automatically within the next hour.

For additional details on how to use the configuration (lastline_setup) tool see:
https://user.lastline.com/lastline-pdf-opsguide-manuals/Administration_Operations_Guide.html#setupapp

To confirm the service status is running and check the last time it was executed, execute service analyst-scheduler-data-retention-all status

root@lastline:/home/monitoring# service analyst-scheduler-data-retention-all status
* analyst-scheduler-data-retention-all.service
   Loaded: loaded (/etc/systemd/system/analyst-scheduler-data-retention-all.service; static; vendor preset: enabled)

   Active: inactive (dead) since Tue 2022-02-15 17:06:17 UTC; 11min ago

  Process: 29557 ExecStopPost=/usr/sbin/service-lastline --provider=docker-compose analyst-scheduler-data-retention-all stop --no-check-status (code=exited, status=0/SUCCESS)
  Process: 29139 ExecStart=/usr/sbin/service-lastline --provider=docker-compose analyst-scheduler-data-retention-all start --no-detach -- (code=exited, status=0/SUCCESS)
Main PID: 29139 (code=exited, status=0/SUCCESS)
Feb 15 17:06:00 lastline systemd[1]: Starting analyst-scheduler-data-retention-all.service...

Feb 15 17:06:17 lastline systemd[1]: Started analyst-scheduler-data-retention-all.service.


Tracking the progress of data being deleted:

To track the progress to confirm the data retention service is cleaning up files, we can look at the file /var/log/analyst-scheduler/analyst-scheduler-data-retention-all.log. The file should tell us when it starts:

Feb 15 17:06:07 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Running Task Enforce Analyst Data Retention Policies... (uuid 5ca57491-####-####-####-########a0b)

Feb 15 17:06:07 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Connected to analyst db on host docker-host(3306) strict=on, compress=off

Feb 15 17:06:07 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Using resume data from /data_retention_resume_ts

Feb 15 17:06:07 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Obtaining resume-date from /data_retention_resume_ts

Feb 15 17:06:07 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Initializing syslog file-change publisher

The new time range for storage based on the detention settings configured:

Feb 15 17:06:07 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Enforcing data retention policies for Analyst data in 2022-02-15 16:50:12 - 2022-02-15 17:06:07

And the files to be deleted along with the status of every task:

Feb 15 17:06:07 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - 19 screenshot result files to be removed for range 2021-11-17 16:50:12 - 2021-11-17 16:55:12

Feb 15 17:06:09 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Deleted 19 files (0.67 MB) for screenshot in 2022-02-15 16:50:12 - 2022-02-15 16:55:12

Feb 15 17:06:09 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - 2 upload files to be removed for range 2021-11-16 16:50:12 - 2021-11-16 16:55:12

Feb 15 17:06:09 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Deleted 2 files (0.59 MB) for upload in 2022-02-15 16:50:12 - 2022-02-15 16:55:12

Feb 15 17:06:09 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - 2 llurl_framework_trace result files to be removed for range 2022-01-15 16:50:12 - 2022-01-15 16:55:12

Feb 15 17:06:09 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Deleted 2 files (0.00 MB) for llurl_framework_trace in 2022-02-15 16:50:12 - 2022-02-15 16:55:12

Feb 15 17:06:10 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Deleted 13 files (1.26 MB) for analysis_results in 2022-02-15 16:50:12 - 2022-02-15 16:55:12

Feb 15 17:06:10 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - 4 traffic_capture result files to be removed for range 2021-12-17 16:50:12 - 2021-12-17 16:55:12

Feb 15 17:06:10 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Deleted 4 files (1.12 MB) for traffic_capture in 2022-02-15 16:50:12 - 2022-02-15 16:55:12

And we see periodically, the service updating the data retention timestamp:

Feb 15 17:06:10 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Updating resume-file /data_retention_resume_ts to 2022-02-15 16:55:12

When the job finishes, we see this output:

Feb 15 17:06:15 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Running Task Enforce Analyst Data Retention Policies DONE (uuid 5ca57491-7f8b-4e25-a4b8-dc9e79c51a0b)

Feb 15 17:06:15 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Logging Timers

Feb 15 17:06:15 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Timer global:

Feb 15 17:06:15 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - global: 7.97

After that we can run again analyst_scheduler_data_usage.py --log-dir /var/log to check the storage usage by artifact type.

Notes:

  • The additional Retention settings in the UI apply to what the UI be able to display.
  • Do not run du while troubleshooting disk space issues on /data. There are a lot of files in this partition and the manager is very IO sensitive.
  • If the above does not resolve your issue, see Creating and managing Broadcom support cases for assistance along with your appliance details (license, appliance uuid).

 

/var full

The /var partition does not contain anywhere near the same volume of files as /data, so running du is not as impactful, but we should still be mindful of the IO load this can cause. However, the files stored here are not as easy to give a set process of how to clean this up. Run the following commands and see Creating and managing Broadcom support cases, be sure to include this along with your appliance details (license, appliance uuid).
 

sudo ionice -c 3 du -xah --time --max-depth=4 /var/ | sort | grep G

sudo lsof +L1