Ingestion tasks fail in druid in NSX-T

Products

VMware NSX

Issue/Introduction

This article provides information on how to resolve an issue where the HDFS disk is full which causes druid to fail ingesting new data. Also, you may observe some issues in visualization and recommendation.

Symptoms:

Ingestion tasks fail in druid.
New config or flow data is not showing up in visualization/recommendation.

Environment

VMware NSX-T Data Center

Cause

This issue occurs due to HDFS disk being full.

Resolution

This is a known issue affecting VMware NSX Intelligence 1.1.x and 1.2.0.

Currently, there is no resolution.

Workaround:
To work around this issue:

Run these commands to confirm if you have the issue:

/opt/apache-hadoop/bin/hdfs dfsadmin -report
/opt/apache-hadoop/bin/hdfs dfs -du /druid

Note: If "DFS Used" from the first command is equal to the present capacity (For example: 300GB) then HDFS is full. You should expand the disk or delete some logs.

If the second command shows a lot of indexing logs:
Use the "/opt/apache-hadoop/bin/hdfs dfs -rm -r /druid/indexing-logs/*" command to delete all indexing logs.

Note: Wait a while and check if DFS usage has gone down.
Druid supervisors should automatically recover in a few hours. You can use the systemctl restart configure-druid command to speed up the recovery.

a. Wait a while and use "curl -X GET http://localhost:8090/druid/indexer/v1/supervisor?state=true" to check if all supervisors are in "RUNNING" state.

b. If any is not, use "curl -X POST http://localhost:8090/druid/indexer/v1/supervisor/<supervisorId>/reset" to reset the supervisor.

Note: If the setup was upgraded from 1.0.x, the following supervisors may exist. You do not need to reset and it can be terminated through:

"curl -X POST http://localhost:8090/druid/indexer/v1/supervisor/<supervisorId>/terminate"

<supervisorID> in

['pace2druid_manager_dfw_rule_config','pace2druid_manager_nsgroup_config','pace2druid_manager_vm_config','pace2druid_policy_dfw_rule_config', 'pace2druid_policy_group_config','pace2druid_policy_service_config','pace2druid_policy_service_entry_config']
Use the systemctl restart nsx-config command to get the latest config objects from NSX.

Additional Information

Aside from the steps mentioned in the workaround section, VMware recommends for all users to proactively run the following:

Run vim /opt/vmware/pace/config/remove_hdfs_indexing_logs.py.
Replace the content inside with the content in the below attachment "remove_hdfs_indexing_logs.py":

import datetime
import os
import shlex
import subprocess

INDEX_DIRS = ("/druid/indexing-logs",)
HDFS_BIN = "/opt/apache-hadoop/bin/hdfs"
TIME_NOW = datetime.datetime.now()
HDFS_DATE_FMT = "%Y-%m-%d %H:%M"
N_MINS_OLD = 10
ENTRY_IN_ONE_COMMAND = 100

to_remove_dirs = []
for index_dir in INDEX_DIRS:
try:
out = subprocess.check_output(shlex.split("%s dfs -ls %s" % (HDFS_BIN, index_dir)),
stderr=subprocess.STDOUT).strip()
except subprocess.CalledProcessError as cpe:
if "No such file or directory" in cpe.output:
continue
raise
if out:
for line in out.splitlines()[1:]:
observed_date = " ".join(line.split()[5:7])
dir_age = (TIME_NOW - datetime.datetime.strptime(observed_date, HDFS_DATE_FMT)).total_seconds() // 60
dir_name = os.path.basename(line.split()[-1])
# If the directory has been there for longer than 10 mins then it is a candidate for removal.
if dir_age >= N_MINS_OLD:
to_remove_dirs.append(line.split()[-1])

if to_remove_dirs:
# sliding window deletion to avoid too-long-argument exception
entry_num = len(to_remove_dirs)
st_index = 0
while st_index < entry_num:
subprocess.check_output(shlex.split("%s dfs -rm -R %s" % (HDFS_BIN, " ".join(to_remove_dirs[st_index:st_index + ENTRY_IN_ONE_COMMAND]))))
st_index += ENTRY_IN_ONE_COMMAND
Save and Exit.