Analyzing Druid Task Failures in the Metrics Tab

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

When the flow rate is too high, or the metadata size within each flow is too large (for example, when there are too many groups), the total size of flows may exceed the threshold used for compaction jobs. As a result, the jobs will not run.

Without compaction, the flow storage usage will grow very fast. This may cause daily or weekly reindexing tasks to fail, eventually leading to faster growth of the flow storage.

Verification:

Login to SSP UI and navigate through:

Select System > Platform & Features > Metrics and scroll to the "Druid Task Failures", you may see failures in Flow Visualization - Index Parallel and Flow Recommendation - Index Parallel.

Environment

vDefend SSP 5.0.0

Cause

High flow rate and flow size cause compaction and re-indexing jobs to fail.

Resolution

Increase Compaction Job Parameters:

Increase the compaction job's input threshold, max heap size, and concurrent task count for the correlated_flow_viz table.

# Log into the druid-router pod

k -n nsxi-platform exec -it svc/druid-router -- bash


# Post the updated compaction spec

# Note that the value for inputSegmentSizeBytes has increased from 3000000000 to 4000000000 and maxNumConcurrentSubTasks has increased from 1 to 2

curl 'https://druid-router:8280/druid/coordinator/v1/config/compaction' \

-H 'Content-Type: application/json' \

--data-raw '{"dataSource":"correlated_flow_viz","taskPriority":25,"inputSegmentSizeBytes":4000000000,"maxRowsPerSegment":null,"skipOffsetFromLatest":"PT1H30M","tuningConfig":{"maxRowsInMemory":500000,"appendableIndexSpec":null,"maxBytesInMemory":100000000,"maxTotalRows":null,"splitHintSpec":{"type":"maxSize","maxSplitSize":4294967296,"maxNumFiles":1000},"partitionsSpec":{"type":"dynamic","maxRowsPerSegment":5000000,"maxTotalRows":10000000},"indexSpec":null,"indexSpecForIntermediatePersists":null,"maxPendingPersists":null,"pushTimeout":null,"segmentWriteOutMediumFactory":null,"maxNumConcurrentSubTasks":2,"maxRetry":null,"taskStatusCheckPeriodMs":null,"chatHandlerTimeout":null,"chatHandlerNumRetries":null,"maxNumSegmentsToMerge":null,"totalNumMergeTasks":null,"maxColumnsToMerge":5000,"type":"index_parallel","forceGuaranteedRollup":false},"granularitySpec":null,"dimensionsSpec":null,"metricsSpec":null,"transformSpec":null,"ioConfig":null,"engine":null,"taskContext":{"druid.indexer.fork.property.druid.processing.buffer.sizeBytes":"128000000","druid.indexer.runner.javaOpts":"-Xms128M -Xmx1024M -XX:MaxDirectMemorySize=1G"}}' \

--insecure


# Verify the spec is updated

curl 'https://druid-router:8280/druid/coordinator/v1/config/compaction/correlated_flow_viz' --insecure

Increase Storage Size for Druid Middle Manager:

Ensure the backend storage has enough resources.

Increase the PVC for druid-middle-manager to 32GB.



# Select all druid middle manager pods and all druid middle manager pvcs

k -n nsxi-platform get pods -l app.kubernetes.io/component=druid.middleManager

k -n nsxi-platform get pvc -l component=middle-manager



# For each PVC, increase the size from 16Gi to 32Gi

k -n nsxi-platform patch pvc <data-druid-middle-manager-X> -p '{"spec": {"resources": {"requests": {"storage": "32Gi"}}}}'



# Restart all druid middle manager pods

k -n nsxi-platform delete pod <druid-middle-manager-0> <druid-middle-manager-1> ..



# Check if all pvcs are updated

k -n nsxi-platform get pvc -l component=middle-manager