Symptoms:
Note that the high persistent storage might be due to multiple factors such as:
- Customer has high data volume
- NAPP Cluster has infra issues
- Repeated indexing data failures
The repeated reindexing failures might be due to a known issue outlined in the Cause section. The optional diagnostic steps are only recommended to advanced users if they want to pinpoint the issue described in the Cause section. However, we suggest all users apply the following patch to either fix the issue or prevent it from happening as the patch will be incorporated in the next release.
------------------------------------ Optional steps-------------------------------------------
To pinpoint the cause, while it is not necessary for all users, some users can skip this and directly apply the workaround in step 4:
In the manager host, use this command login to the coordinator pod:
napp-k exec -it $(napp-k get pods | grep druid-coordinator | awk '{print $1}' ) -- bash |
Use this command to list the last 10 'index' task statuses, and use your favorite JSON viewer or this online-json-viewer to format it. It is going to be a list of task JSON object, check the status of each task. If they are repeatedly failing, that means the disk-usage-high issue is caused by not compacting existing data.
curl --insecure -XGET -H "Content-type: application/json" 'https://localhost:8281/druid/indexer/v1/tasks?type=index&max=10' sample output: [{ "id" : "index_correlated_flow_oaffjogi_2024-02-26T19:00:06.973Z" , "groupId" : "index_correlated_flow_oaffjogi_2024-02-26T19:00:06.973Z" , "type" : "index" , "createdTime" : "2024-02-26T19:00:06.974Z" , "queueInsertionTime" : "1970-01-01T00:00:00.000Z" , "statusCode" : "SUCCESS" , "status" : "SUCCESS" , "runnerStatusCode" : "NONE" , "duration" :10257, "location" :{ "host" : "192.168.145.5" , "port" :-1, "tlsPort" :8104}, "dataSource" : "correlated_flow" , "errorMsg" :null}] |
Find any recent failed task based on the 'createdTime' field in the list, and use the following command to get the payload of the task. Replace the "<taskId>" with the 'id' field value from the previous result. Users can use JSON viewer to parse the payload like the sample below. If the "granularitySpec.segmentGranularity" is "DAY", the expected "ioConfig.inputSource.interval" is 1 day. Likewise, If the "granularitySpec.segmentGranularity" is "WEEK", the expected "ioConfig.inputSource.interval" is 7 day. If users see the interval greater than the expected value, it means the reindexing task is consuming too much data at a time, and will likely to fail. Note: "granularitySpec.segmentGranularity" is other value than "DAY" and "WEEK", check the payload of another "<taskId>" until a daily or weekly task is found.
curl --insecure -XGET -H "Content-type: application/json" 'https: //localhost :8281 /druid/indexer/v1/task/ :"<taskId>" sample output: { "task" : "index_parallel_correlated_flow_viz_limajgjp_2024-03-12T06:50:57.945Z" , "payload" :{ .... .... "spec" : { "dataSchema" : { "dataSource" :... "timestampSpec" :... ... "granularitySpec" : { "type" : "uniform" , ------------------------------------------- "segmentGranularity" : "DAY" , ------------------------------------------- "queryGranularity" : "DAY" , "rollup" : true , "intervals" : [] }, "transformSpec" :... }, "ioConfig" : { "type" : "index_parallel" , "inputSource" : { "type" : "druid" , "dataSource" : "correlated_flow_viz" , ------------------------------------------- "interval" : "2024-03-10T00:00:00.000Z/2024-03-11T00:00:00.000Z" ------------------------------------------- }, "inputFormat" : null, "appendToExisting" : false , "dropExisting" : false }, .... .... } } } |
Use the following commands to check the retention rule.
1. curl -XGET -H 'Content-Type: application/json' -k https: //localhost :8281 /druid/coordinator/v1/rules/correlated_flow -k 2. curl -XGET -H 'Content-Type: application/json' -k https: //localhost :8281 /druid/coordinator/v1/rules/correlated_flow_viz -k 3. curl -XGET -H 'Content-Type: application/json' -k https: //localhost :8281 /druid/coordinator/v1/rules/correlated_flow_rec -k |
The expected output should be:
[{ "period" : "P30D" , "includeFuture" : true , "tieredReplicants" :{ "_default_tier" :2}, "useDefaultTierForNull" : true , "type" : "loadByPeriod" },{ "type" : "dropForever" }] |
If it is not like this, it means the retention rule was dropped or not applied correctly, thus we are keeping data beyond 30 days.
The reindexing jobs aim to merge non-unique flows with increased granularity, effectively reducing the overall segment size. However, when these reindexing jobs fail, an excessive number of segments accumulate, leading to storage issues for the historical nodes. These jobs typically process data spanning a week or a day. They utilize the timestamp from the last successful reindexing job as a starting point to calculate the interval. Occasionally, there are infrastructure issues that cause a job failure for the current week. In such cases, the next week's job attempts to cover two weeks' worth of data, since the last successful job was not changed. This expanded reindexing task often leads to out-of-storage failures in the middle manager, due to attempting to process an excessive amount of data. This results in a vicious cycle where subsequent reindexing attempts continue to fail, as each job's interval becomes increasingly larger. Moreover, when the Druid middle manager reads segments for reindexing, the intermediate files are left uncleaned after job failures further aggravating the out-of-storage issue.
Druid retains most datasources for 30 days, sometimes the 30-day retention rules would not be applied successfully, therefore Druid will not purge data more than 30 days old. Additionally, we are keeping 2 copies of cached data for Druid query, which increases availability while using double the storage. We can decrease it from 2 to 1 without permanently losing any data, it only takes a couple of minutes to recover cached data in case of data server goes down which is rare.
The resolution will be implemented in the next minor release version. For earlier versions, users can execute the patch below to achieve a similar remedy.
Workaround:
In the manager host, use this command login to the coordinator pod:
napp-k exec -it $(napp-k get pods | grep druid-coordinator | awk '{print $1}' ) -- bash |
While still logged in the Coordinator pod, apply the following command to set or update the retention rules, and change the retention rules replica from 2 to 1.
1. curl -XPOST -H 'Content-Type: application/json' -k https: //localhost :8281 /druid/coordinator/v1/rules/correlated_flow -d '[{"type" : "loadByPeriod", "period" : "P30D", "includeFuture" : true, "tieredReplicants": {"_default_tier" : 1}}, {"type" : "dropForever"}]' 2. curl -XPOST -H 'Content-Type: application/json' -k https: //localhost :8281 /druid/coordinator/v1/rules/correlated_flow_viz -d '[{"type" : "loadByPeriod", "period" : "P30D", "includeFuture" : true, "tieredReplicants": {"_default_tier" : 1}}, {"type" : "dropForever"}]' 3. curl -XPOST -H 'Content-Type: application/json' -k https: //localhost :8281 /druid/coordinator/v1/rules/correlated_flow_rec -d '[{"type" : "loadByPeriod", "period" : "P30D", "includeFuture" : true, "tieredReplicants": {"_default_tier" : 1}}, {"type" : "dropForever"}]' |
Verify the integrity of the file by checking the MD5 sum.
md5sum patch. tar .gz expected value: 2809bd0c7be19a24e7ef736b76a5251a |
Extract the tar.
tar xvf patch. tar .gz |
Get into the folder and make the patch script executable.
cd workaroundForReindexingFailure chmod +x patch.sh |
Optional: if you are in the manager and use "napp-k" alias, you can use this command to get the "kubeconfig" file path.
alias | grep "napp-k" OR alias expected output: napp-k= 'kubectl --kubeconfig=<YourKubeconfigFilePath> -n nsxi-platform' |
Execute the script.
./patch.sh --kubeconfig=YourKubeconfigFilePath |
Wait 10-30 minutes for Historicals to drop cached data since we changed the replica value from 2 to 1. Go back to the terminal session that logged in to the Coordinator pod again, and use the following command to monitor the 'currSize' vs 'maxSize', it should keep decreasing until stabilized.
curl --insecure -XGET -H "Content-type: application/json" 'https://localhost:8281/druid/coordinator/v1/servers?simple' Sample output: [{ "host" : "localhost:8083" , "tier" : "_default_tier" , "type" : "historical" , "priority" :0, "currSize" :991588, "maxSize" :300000000000}] |
Keep monitoring for up to 4 weeks for the status of reindexing tasks using this command in the Coordinator pod, our daily reindexing tasks and weekly reindexing tasks should start to appear to be successful in the following days and weeks.
curl --insecure -XGET -H "Content-type: application/json" 'https://localhost:8281/druid/indexer/v1/tasks?type=index&max=10' |