NSX Network Detection and Response - What to do when Manager/Analyst/Pinbox disk is full
search cancel

NSX Network Detection and Response - What to do when Manager/Analyst/Pinbox disk is full

book

Article ID: 323943

calendar_today

Updated On:

Products

VMware

Issue/Introduction

Overview

We first need to start by understanding which partition is full. You can do this by running the command lastline-df -d -h 

image.png

The two main partitions Technical Support has seen fill up have been /data and /var. Each of these partitions has a different process to address the space. Depending on which partition is reporting full use the guidelines below.

/data full


The first step to look into /data being full is to run the command analyst_scheduler_data_usage.py 

When this command finishes, you will get an output similar to the one below:

Executing analyst_scheduler_data_usage.py as current user
-> Running Task Analyze data usage... (uuid 68efgh9b-86ef-4e92-a1b0-a8f91dc8dbef)
-> Connected to analyst db on host 127.0.0.1(3306) strict=on, compress=off
-> Processing data for 2010-01-01 00:00:00 - 2020-04-19 21:01:57 [1 day, 0:00:00]
-> Running Task Analyze data usage DONE (uuid68efgh9b-86ef-4e92-a1b0-a8f91dc8dbef)
-> Logging Timers
-> Timer global:
-> global: 45.77
Total: 887.43 GB in 2,803,904 files
Per type:
* upload: 477.08 GB (53.76%) in 1,068,241 files (38.10%)
* result.traffic_capture: 232.71 GB (26.22%) in 321,220 files (11.46%)
* result.report: 70.80 GB (7.98%) in 622,282 files (22.19%)
* result.llurl_framework_trace: 65.60 GB (7.39%) in 313,993 files (11.20%)
* result.screenshot: 38.84 GB (4.38%) in 416,285 files (14.85%)
* result.process_snapshot: 1.47 GB (0.17%) in 1,001 files (0.04%)
* result.codehash_yara_strings: 0.55 GB (0.06%) in 4,601 files (0.16%)
* result.codehash: 0.20 GB (0.02%) in 4,650 files (0.17%)
* result.generated_file: 0.12 GB (0.01%) in 335 files (0.01%)
* result.jaccine_ast: 0.03 GB (0.00%) in 48,630 files (1.73%)
* result.extracted_file: 0.01 GB (0.00%) in 55 files (0.00%)
* storage.webpage: 0.01 GB (0.00%) in 910 files (0.03%)
* result.executed_script_content: 0.00 GB (0.00%) in 260 files (0.01%)
* result.llbist_analyzer_trace: 0.00 GB (0.00%) in 1,441 files (0.05%)* result.extracted_file: 0.01 GB (0.00%) in 152 files (0.01%)

From here we can get an idea of what analysis artifacts are taking up space in /data and determine what retention settings will need to be adjusted. Most of the above values are linked to retention settings in lastline_setup , which is a Manager CLI command tool that allows access to additional configurations as described in the appliance manual.

The default retention settings are the following:

data_retention_code = 60 days
data_retention_generated_files = 21 days
data_retention_memory_dumps = 7 days
data_retention_process_dumps = 21 days
data_retention_screenshots = unlimited
data_retention_traffic_captures = unlimited
data_retention_uploads = unlimited
data_retention_webpages = 21 days

Here are the mappings of the common retention settings to adjust for various results from the initial command:

upload =  data_retention_uploads
     Note: To change this value first we need to adjust analysis_queue_backlog to be lower than data_retention_uploads.
result.traffic_capture = data_retention_traffic_captures
result.screenshot = data_retention_screenshots
result.report: = This one is not in the CLI, but the UI as Analysis results (days) by going to Admin > Appliances > Manager > Quick Links > Configuration > Data Retention as seen in the screenshot below:

image.png

After adjusting the retention settings, the service should start automatically within the next hour.

For additional details on how to use the configuration (lastline_setup) tool see:
https://user.lastline.com/lastline-pdf-opsguide-manuals/Administration_Operations_Guide.html#setupapp

To confirm the service status is running and check the last time it was executed, execute service analyst-scheduler-data-retention-all status

root@lastline:/home/monitoring# service analyst-scheduler-data-retention-all status
* analyst-scheduler-data-retention-all.service
   Loaded: loaded (/etc/systemd/system/analyst-scheduler-data-retention-all.service; static; vendor preset: enabled)

   Active: inactive (dead) since Tue 2022-02-15 17:06:17 UTC; 11min ago

  Process: 29557 ExecStopPost=/usr/sbin/service-lastline --provider=docker-compose analyst-scheduler-data-retention-all stop --no-check-status (code=exited, status=0/SUCCESS)
  Process: 29139 ExecStart=/usr/sbin/service-lastline --provider=docker-compose analyst-scheduler-data-retention-all start --no-detach -- (code=exited, status=0/SUCCESS)
Main PID: 29139 (code=exited, status=0/SUCCESS)
Feb 15 17:06:00 lastline systemd[1]: Starting analyst-scheduler-data-retention-all.service...

Feb 15 17:06:17 lastline systemd[1]: Started analyst-scheduler-data-retention-all.service.


Tracking the progress of data being deleted:

To track the progress to confirm the data retention service is cleaning up files, we can look at the file /var/log/analyst-scheduler/analyst-scheduler-data-retention-all.log. The file should tell us when it starts:

Feb 15 17:06:07 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Running Task Enforce Analyst Data Retention Policies... (uuid 5ca57491-7f8b-4e25-a4b8-dc9e79c51a0b)

Feb 15 17:06:07 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Connected to analyst db on host docker-host(3306) strict=on, compress=off

Feb 15 17:06:07 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Using resume data from /data_retention_resume_ts

Feb 15 17:06:07 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Obtaining resume-date from /data_retention_resume_ts

Feb 15 17:06:07 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Initializing syslog file-change publisher

The new time range for storage based on the detention settings configured:

Feb 15 17:06:07 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Enforcing data retention policies for Analyst data in 2022-02-15 16:50:12 - 2022-02-15 17:06:07

And the files to be deleted along with the status of every task:

Feb 15 17:06:07 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - 19 screenshot result files to be removed for range 2021-11-17 16:50:12 - 2021-11-17 16:55:12

Feb 15 17:06:09 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Deleted 19 files (0.67 MB) for screenshot in 2022-02-15 16:50:12 - 2022-02-15 16:55:12

Feb 15 17:06:09 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - 2 upload files to be removed for range 2021-11-16 16:50:12 - 2021-11-16 16:55:12

Feb 15 17:06:09 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Deleted 2 files (0.59 MB) for upload in 2022-02-15 16:50:12 - 2022-02-15 16:55:12

Feb 15 17:06:09 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - 2 llurl_framework_trace result files to be removed for range 2022-01-15 16:50:12 - 2022-01-15 16:55:12

Feb 15 17:06:09 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Deleted 2 files (0.00 MB) for llurl_framework_trace in 2022-02-15 16:50:12 - 2022-02-15 16:55:12

Feb 15 17:06:10 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Deleted 13 files (1.26 MB) for analysis_results in 2022-02-15 16:50:12 - 2022-02-15 16:55:12

Feb 15 17:06:10 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - 4 traffic_capture result files to be removed for range 2021-12-17 16:50:12 - 2021-12-17 16:55:12

Feb 15 17:06:10 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Deleted 4 files (1.12 MB) for traffic_capture in 2022-02-15 16:50:12 - 2022-02-15 16:55:12

And we will see periodically the service updating the data retention timestamp:

Feb 15 17:06:10 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Updating resume-file /data_retention_resume_ts to 2022-02-15 16:55:12

When the job finishes we will see this output:

Feb 15 17:06:15 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Running Task Enforce Analyst Data Retention Policies DONE (uuid 5ca57491-7f8b-4e25-a4b8-dc9e79c51a0b)

Feb 15 17:06:15 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Logging Timers

Feb 15 17:06:15 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - Timer global:

Feb 15 17:06:15 lastline analyst-scheduler_analyst-scheduler-data-retention-all_1[9066]: analyst_data_retention - INFO - global: 7.97

After that we can run again analyst_scheduler_data_usage.py --log-dir /var/log to check the storage usage by artifact type.

Notes:

  • The additional Retention settings in the UI apply to what the UI be able to display.
  • Do not run du while troubleshooting disk space issues on /data. There are a lot of files in this partition and the manager is very IO sensitive.
  • If the above does not resolve your issue, please visit https://customerconnect.vmware.com/group/vmware/get-help and refer to “VMware Technical Support” for assistance along with your appliance details (license, appliance uuid).

 

/var full

The /var partition does not contain anywhere near the same volume of files as /data, so running du is not as impactful, but we should still be mindful of the IO load this can cause. However, the files stored here are not as easy to give a set process of how to clean this up.Please run the following commands and open a ticket at https://customerconnect.vmware.com/group/vmware/get-help and refer to “VMware Technical Support” , be sure to include this along with your appliance details (license, appliance uuid).
 

sudo ionice -c 3 du -xah --time --max-depth=4 /var/ | sort | grep G

sudo lsof +L1