Flow Storage Full alarm and Minio Storage Full alarm in SSP 5.1 due to stale Druid shuffle data

search cancel

Flow Storage Full alarm and Minio Storage Full alarm in SSP 5.1 due to stale Druid shuffle data

book

Article ID: 424450

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

SSP 5.1 may observe "Flow Storage Full" alarm along with "Platform Disk Usage High/ Very High" alarm for MinIO as shown below. Specific MinIO pods may report nearly 100% disk usage.

The storage is full and the ingestion of flows is paused or the system has reduced the retention of older flows. This impacts the quality of analysis as it is working with stale flows and/or reduced historical flows.

Recommended Action: Flow storage is full. Both worker nodes and components need to be scaled out. Please follow the KB for steps to reduce the number of flows, or scale out the deployment to meet the sizing requirements. You can find current values of retention days and predicted full days in metrics page by navigating to 'Platform & Services' under the 'System' tab and choosing 'Metrics'.


The disk usage of Security Services Platform component minio/data-minio-7 is currently 98.08%, which is above the threshold value. 
Recommended Action: See if files on the respective disk can be cleaned up. Follow the instructions in the KB to remediate the issue.

Environment

SSP 5.1.0

Cause

The issue is caused by the Druid intermediate shuffle data not being automatically cleaned up from the MinIO storage.

Druid uses MinIO to store intermediate data during ingestion tasks (shuffle data). In some scenarios, these temporary files located at "/data/minio/druid/druid/segments/shuffle-data" are not deleted after the task completes. Over time, these stale files accumulate, consuming all available storage allocated to the MinIO pods.

To confirm this issue:

Check for "Flow Storage Full" alarms in the SSP UI.
Log in to the SSPI and check the disk usage of the MinIO pods.

# Get MinIO pods

k -n nsxi-platform get pods -l app.kubernetes.io/name=minio

# Exec into a MinIO pod (e.g., minio-0) and check disk usage

k -n nsxi-platform exec minio-0 -- df -H /data/minio

#If /data/minio is near 100% usage, check the size of the shuffle-data directory
#If this directory size is significantly large (e.g., tens of GBs), this issue is present.

k -n nsxi-platform exec minio-0 -- du -sh /data/minio/druid/druid/segments/shuffle-data

Resolution

To resolve this issue, apply a workaround script that periodically clean up stale shuffle data older than 24 hours.

Download the patch script :

Save the script to a file named "patch_druid_cleanup.sh" on SSPI

Make the script executable

chmod +x patch_druid_cleanup.sh

You can run the script with a dry-run flag first to verify:

./patch_druid_cleanup.sh --dry-run

Then apply the changes:

./patch_druid_cleanup.sh

You may need to specify the kubeconfig location using "--kubeconfig <path>".

./patch_druid_cleanup.sh --kubeconfig /config/clusterctl/1/workload.kubeconfig

The script output provides a command to manually trigger the job. You may run that command to verify the cleanup works immediately.

#Example command

k -n nsxi-platform create job --from=cronjob/check-druid manual-cleanup-test

#Check the logs of the manual job:
#Find the pod name

k -n nsxi-platform get pods -l job-name=manual-cleanup-test

# Check logs

k -n nsxi-platform logs <pod-name-from-above>

#You should see logs indicating "Scanning ... for items older than 24 hours" and confirming deletion.

Attachments

patch_druid_cleanup.sh get_app

Feedback

thumb_up Yes

thumb_down No