Napp Data Storage disk usage high due to stale data in MinIO

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Some stale data may remain in the data storage disk due to two reasons:
(1) Druid kill tasks are not scheduled frequently enough.
(2) Druid overlord and coordinator pods may lose leadership during upgrade and no longer create kill tasks to clean the segments.

The "persistent storage" size under NSX Application Platform → Core services → Data Storage grows significantly after Napp upgrade.

(1) Log into NSX manager as root

(2) Run the following command: napp-k get pods --selector=app.kubernetes.io/component=minio

(3) For each of the minio pod in the output above, run
napp-k exec -i <minio pod name> -- find /data/minio/druid/druid/segments/ -maxdepth 2 -mindepth 2

Output example:
/data/minio/druid/druid/segments/pace2druid_manager_realization_config/2024-07-25T00:00:00.000Z_2024-07-25T01:00:00.000Z
/data/minio/druid/druid/segments/pace2druid_manager_realization_config/2024-07-26T01:00:00.000Z_2024-07-26T02:00:00.000Z
/data/minio/druid/druid/segments/pace2druid_manager_realization_config/2024-07-24T01:00:00.000Z_2024-07-24T02:00:00.000Z
/data/minio/druid/druid/segments/pace2druid_policy_intent_config/2024-07-19T22:00:00.000Z_2024-07-19T23:00:00.000Z
/data/minio/druid/druid/segments/correlated_flow_viz/2024-07-30T05:00:00.000Z_2024-07-30T06:00:00.000Z
/data/minio/druid/druid/segments/correlated_flow_viz/2024-07-22T00:00:00.000Z_2024-07-23T00:00:00.000Z

(4) If there are segments under "pace2druid_manager_realization_config", "pace2druid_policy_intent_config", "correlated_flow_viz", "correlated_flow", "correlated_flow_rec" folders, and start dates are more than 37 days old, or end dates are more than 30 days old, then a clean up is required.

Following is only required for Napp 4.1.2 and older versions:

(5) Run the following command: napp-k exec -i svc/druid-coordinator -- curl https://localhost:8281/druid/coordinator/v1/isLeader -sk

(6) If the response is {"leader":true}, then there is no concern. If the response is {"leader":false"}, then the druid coordinator pod needs to be restarted.

(7) Run the following command: napp-k exec -i svc/druid-overlord -- curl https://localhost:8290/druid/indexer/v1/isLeader -sk

(8) If the response is {"leader":true}, then there is no concern. If the response is {"leader":false"}, then the druid overlord pod needs to be restarted.

Cause

In Napp, compaction tasks will compact smaller segments into larger ones to save space, while kill tasks will remove compacted segments and older segments past retention. Kill tasks will prioritize removing more recent data which are the compacted segments. When they're not scheduled frequently enough, older segments are left in the data storage (MinIO), and a manual intervention is required to clean them.
During the upgrade, zookeeper pods will restart and may cause druid overlord and coordinator to lose leadership status. When the pods do not have leadership, they will not issue tasks to clean data.

Resolution

The issue is fixed in 4.2.0.

For versions before 4.2, customers need to run a clean up script attached to this article :

(a) copy the script "minio.sh" to NSX manager "/tmp" partition

(b) chmod +x minio.sh

(c) ./minio.sh

They should also apply the following changes to increase the frequency of the kill tasks and avoid future cleaning.

(1) First edit the configmap for druid-coordinator:
"napp-k edit cm druid-coordinator-config"

(2) Find the field "druid.coordinator.kill.period" and set it to "PT600S".

(3) Add a new field "druid.coordinator.period.indexingPeriod" under "druid.coordinator.kill.period", and set the value to "PT590S". If the field already exists, modify its value to "PT590S".

(4) Then restart coordinator pod:
"napp-k rollout restart deployment druid-coordinator"

Attachments

minio.sh get_app