Failures when reindexing data in the Time-series DB causing NAPP Data Storage service alarm

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Symptoms:

You are running NSX Application Platform (NAPP) version 3.2.1, 4.0.1, or 4.1.1 and Intelligence UI displays an alarm indicating high persistent storage similar to
"The disk usage of Data Storage service is above the high threshold value of 75%."

Note that the high persistent storage might be due to multiple factors such as:
- Customer has high data volume
- NAPP Cluster has infra issues
- Repeated indexing data failures
The repeated reindexing failures might be due to a known issue outlined in the Cause section. The optional diagnostic steps are only recommended to advanced users if they want to pinpoint the issue described in the Cause section. However, we suggest all users apply the following patch to either fix the issue or prevent it from happening as the patch will be incorporated in the next release.

------------------------------------ Optional steps-------------------------------------------

To pinpoint the cause, while it is not necessary for all users, some users can skip this and directly apply the workaround in step 4:

In the manager host, use this command login to the coordinator pod:

napp-k exec -it $(napp-k get pods | grep druid-coordinator | awk '{print $1}') -- bash

Use this command to list the last 10 'index' task statuses, and use your favorite JSON viewer or this online-json-viewer to format it. It is going to be a list of task JSON object, check the status of each task. If they are repeatedly failing, that means the disk-usage-high issue is caused by not compacting existing data.

curl --insecure -XGET -H "Content-type: application/json" 'https://localhost:8281/druid/indexer/v1/tasks?type=index&max=10'

sample output:
[{"id":"index_correlated_flow_oaffjogi_2024-02-26T19:00:06.973Z","groupId":"index_correlated_flow_oaffjogi_2024-02-26T19:00:06.973Z","type":"index","createdTime":"2024-02-26T19:00:06.974Z","queueInsertionTime":"1970-01-01T00:00:00.000Z","statusCode":"SUCCESS","status":"SUCCESS","runnerStatusCode":"NONE","duration":10257,"location":{"host":"192.168.145.5","port":-1,"tlsPort":8104},"dataSource":"correlated_flow","errorMsg":null}]

Find any recent failed task based on the 'createdTime' field in the list, and use the following command to get the payload of the task. Replace the "<taskId>" with the 'id' field value from the previous result. Users can use JSON viewer to parse the payload like the sample below. If the "granularitySpec.segmentGranularity" is "DAY", the expected "ioConfig.inputSource.interval" is 1 day. Likewise, If the "granularitySpec.segmentGranularity" is "WEEK", the expected "ioConfig.inputSource.interval" is 7 day. If users see the interval greater than the expected value, it means the reindexing task is consuming too much data at a time, and will likely to fail. Note: "granularitySpec.segmentGranularity" is other value than "DAY" and "WEEK", check the payload of another "<taskId>" until a daily or weekly task is found.

curl --insecure -XGET -H "Content-type: application/json" 'https://localhost:8281/druid/indexer/v1/task/:"<taskId>"

sample output:
{
"task": "index_parallel_correlated_flow_viz_limajgjp_2024-03-12T06:50:57.945Z",
"payload":{
     ....
     ....
     "spec": {
     "dataSchema": {
     "dataSource":...
     "timestampSpec":...
         ...
     "granularitySpec": {
          "type": "uniform",
-------------------------------------------
          "segmentGranularity": "DAY",
-------------------------------------------
     "queryGranularity": "DAY",
          "rollup": true,
          "intervals": []
        },
   "transformSpec":...
       },
     "ioConfig": {
        "type": "index_parallel",
        "inputSource": {
          "type": "druid",
          "dataSource": "correlated_flow_viz",
-------------------------------------------
     "interval": "2024-03-10T00:00:00.000Z/2024-03-11T00:00:00.000Z"
-------------------------------------------
     },
        "inputFormat": null,
        "appendToExisting": false,
        "dropExisting": false
      },
      ....
      ....
     }
  }

}

Use the following commands to check the retention rule.

1. curl -XGET -H 'Content-Type: application/json' -k https://localhost:8281/druid/coordinator/v1/rules/correlated_flow -k
2. curl -XGET -H 'Content-Type: application/json' -k https://localhost:8281/druid/coordinator/v1/rules/correlated_flow_viz -k
3. curl -XGET -H 'Content-Type: application/json' -k https://localhost:8281/druid/coordinator/v1/rules/correlated_flow_rec -k

The expected output should be:

[{"period":"P30D","includeFuture":true,"tieredReplicants":{"_default_tier":2},"useDefaultTierForNull":true,"type":"loadByPeriod"},{"type":"dropForever"}]

If it is not like this, it means the retention rule was dropped or not applied correctly, thus we are keeping data beyond 30 days.

Environment

VMware NSX-T Data Center 4.x
VMware NSX-T Data Center

Cause

The reindexing jobs aim to merge non-unique flows with increased granularity, effectively reducing the overall segment size. However, when these reindexing jobs fail, an excessive number of segments accumulate, leading to storage issues for the historical nodes. These jobs typically process data spanning a week or a day. They utilize the timestamp from the last successful reindexing job as a starting point to calculate the interval. Occasionally, there are infrastructure issues that cause a job failure for the current week. In such cases, the next week's job attempts to cover two weeks' worth of data, since the last successful job was not changed. This expanded reindexing task often leads to out-of-storage failures in the middle manager, due to attempting to process an excessive amount of data. This results in a vicious cycle where subsequent reindexing attempts continue to fail, as each job's interval becomes increasingly larger. Moreover, when the Druid middle manager reads segments for reindexing, the intermediate files are left uncleaned after job failures further aggravating the out-of-storage issue.

Druid retains most datasources for 30 days, sometimes the 30-day retention rules would not be applied successfully, therefore Druid will not purge data more than 30 days old. Additionally, we are keeping 2 copies of cached data for Druid query, which increases availability while using double the storage. We can decrease it from 2 to 1 without permanently losing any data, it only takes a couple of minutes to recover cached data in case of data server goes down which is rare.

Resolution

The resolution will be implemented in the next minor release version. For earlier versions, users can execute the patch below to achieve a similar remedy.

Workaround:

4.1.1. Change datasource retention and replica.

In the manager host, use this command login to the coordinator pod:

napp-k exec -it $(napp-k get pods | grep druid-coordinator | awk '{print $1}') -- bash
While still logged in the Coordinator pod, apply the following command to set or update the retention rules, and change the retention rules replica from 2 to 1.

1. curl -XPOST -H 'Content-Type: application/json' -k https://localhost:8281/druid/coordinator/v1/rules/correlated_flow -d

'[{"type" : "loadByPeriod", "period" : "P30D", "includeFuture" : true, "tieredReplicants": {"_default_tier" : 1}}, {"type" : "dropForever"}]'

2. curl -XPOST -H 'Content-Type: application/json' -k https://localhost:8281/druid/coordinator/v1/rules/correlated_flow_viz -d

'[{"type" : "loadByPeriod", "period" : "P30D", "includeFuture" : true, "tieredReplicants": {"_default_tier" : 1}}, {"type" : "dropForever"}]'

3. curl -XPOST -H 'Content-Type: application/json' -k https://localhost:8281/druid/coordinator/v1/rules/correlated_flow_rec -d

'[{"type" : "loadByPeriod", "period" : "P30D", "includeFuture" : true, "tieredReplicants": {"_default_tier" : 1}}, {"type" : "dropForever"}]'

4.1.2. Download the patch tar attached to this KB

4.1.3. Without closing the existing connection to the coordinator POD, open another SSH session to the same NSX Manager, and apply the workaround using the following command to reset the last successful reindexing task, as well as clean the middle manager's stale data.

Verify the integrity of the file by checking the MD5 sum.

md5sum patch.tar.gz

expected value: 2809bd0c7be19a24e7ef736b76a5251a
Extract the tar.

tar xvf patch.tar.gz
Get into the folder and make the patch script executable.

cd workaroundForReindexingFailure
chmod +x patch.sh
Optional: if you are in the manager and use "napp-k" alias, you can use this command to get the "kubeconfig" file path.

alias | grep "napp-k"

OR

alias

expected output:
napp-k='kubectl --kubeconfig=<YourKubeconfigFilePath> -n nsxi-platform'
Execute the script.

./patch.sh --kubeconfig=YourKubeconfigFilePath

4.1.4. Keep an eye on the Historical load.

Wait 10-30 minutes for Historicals to drop cached data since we changed the replica value from 2 to 1. Go back to the terminal session that logged in to the Coordinator pod again, and use the following command to monitor the 'currSize' vs 'maxSize', it should keep decreasing until stabilized.

curl --insecure -XGET -H "Content-type: application/json" 'https://localhost:8281/druid/coordinator/v1/servers?simple'

Sample output:
[{"host":"localhost:8083","tier":"_default_tier","type":"historical","priority":0,"currSize":991588,"maxSize":300000000000}]

Keep monitoring for up to 4 weeks for the status of reindexing tasks using this command in the Coordinator pod, our daily reindexing tasks and weekly reindexing tasks should start to appear to be successful in the following days and weeks.

curl --insecure -XGET -H "Content-type: application/json" 'https://localhost:8281/druid/indexer/v1/tasks?type=index&max=10'
Keep monitoring for up to 4 weeks for the disk usage of historicals using the command in 4.1.4 step 1. The 'currSize' should be further decreasing in the following days and weeks, as old data will expire after the 30-day threshold, and new data will be compacted successfully by the reindexing tasks.

Attachments

patch.tar get_app