SSP 5.1 may observe "Flow Storage Full" alarm along with "Platform Disk Usage High/ Very High" alarm for MinIO as shown below. Specific MinIO pods may report nearly 100% disk usage.
The storage is full and the ingestion of flows is paused or the system has reduced the retention of older flows. This impacts the quality of analysis as it is working with stale flows and/or reduced historical flows.
Recommended Action: Flow storage is full. Both worker nodes and components need to be scaled out. Please follow the KB for steps to reduce the number of flows, or scale out the deployment to meet the sizing requirements. You can find current values of retention days and predicted full days in metrics page by navigating to 'Platform & Services' under the 'System' tab and choosing 'Metrics'.
The disk usage of Security Services Platform component minio/data-minio-7 is currently 98.08%, which is above the threshold value.
Recommended Action: See if files on the respective disk can be cleaned up. Follow the instructions in the KB to remediate the issue.
SSP 5.1.0
The issue is caused by the Druid intermediate shuffle data not being automatically cleaned up from the MinIO storage.
Druid uses MinIO to store intermediate data during ingestion tasks (shuffle data). In some scenarios, these temporary files located at "/data/minio/druid/druid/segments/shuffle-data" are not deleted after the task completes. Over time, these stale files accumulate, consuming all available storage allocated to the MinIO pods.
To confirm this issue:
Check for "Flow Storage Full" alarms in the SSP UI.
Log in to the SSPI and check the disk usage of the MinIO pods.
# Get MinIO pods
k -n nsxi-platform get pods -l app.kubernetes.io/name=minio
# Exec into a MinIO pod (e.g., minio-0) and check disk usage
k -n nsxi-platform exec minio-0 -- df -H /data/minio
#If /data/minio is near 100% usage, check the size of the shuffle-data directory
#If this directory size is significantly large (e.g., tens of GBs), this issue is present.
k -n nsxi-platform exec minio-0 -- du -sh /data/minio/druid/druid/segments/shuffle-data
To resolve this issue, apply a workaround script that periodically clean up stale shuffle data older than 24 hours.
Download the patch script :
Save the script to a file named "patch_druid_cleanup.sh" on SSPI
Make the script executable
chmod +x patch_druid_cleanup.sh
You can run the script with a dry-run flag first to verify:
./patch_druid_cleanup.sh --dry-run
Then apply the changes:
./patch_druid_cleanup.sh
You may need to specify the kubeconfig location using "--kubeconfig <path>".
./patch_druid_cleanup.sh --kubeconfig /config/clusterctl/1/workload.kubeconfig
The script output provides a command to manually trigger the job. You may run that command to verify the cleanup works immediately.
#Example command
k -n nsxi-platform create job --from=cronjob/check-druid manual-cleanup-test
#Check the logs of the manual job:
#Find the pod name
k -n nsxi-platform get pods -l job-name=manual-cleanup-test
# Check logs
k -n nsxi-platform logs <pod-name-from-above>
#You should see logs indicating "Scanning ... for items older than 24 hours" and confirming deletion.