Slowness in the UI loading or recommendation jobs failing due to growing Druid segments for correlated_flow

Products

VMware NSX Networking

Issue/Introduction

Symptoms:

Druid compaction job failures for context tables in Druid log files.
Druid compaction job failures for correlated_flow or correlated_flow_viz in Druid log files.
Slow loading of group view at greater than one hour time frame.

Note: There are situations where this symptom is not visible, however, segments are still growing in the background.

Environment

VMware NSX-T Data Center

Cause

Druid database works on time chunks (hourly, weekly, etc.). For context data (endpoint features like GI, etc.), we have ingestion jobs that ingest data and work on WEEKLY chunks, thereby locking that time period to run those jobs. Hence, the entire time period from Monday to Sunday is locked by the ingestion job. When a background job such as compaction kicks, it tries to acquire the lock on the same time chunk overlapping with the currently running ingestion jobs and terminates as it cannot get the lock. It goes into retry mode and eventually ends up in a WAITING state. We are limited by available druid compaction resources which results in blocking compaction running on certain data sources. However, the issue does not happen consistently and is dependent on the ingestion load when these compaction jobs kick in.

During Druid's ingestion, if it obtains timestamps which are not current (which may be caused by wrong time sync on hosts or other issues), it will lock those time periods to create new segments. At the same time, compaction and rollup jobs for those time periods will fail. This leads to an increase in segment count for the correlated_flow and correlated_flow_viz tables. A large segment count for correlated_flow_viz will cause worse query performance.

Also, another cause for the increasing segments is due to the segment rollup job not properly scheduled. This causes all segments to remain in hourly interval, and its number grows over time.

Resolution

This is a known issue impacting NSX Intelligence 1.2.0.

Currently, there is no resolution.

Workaround:
Change the context compaction job frequency run weekly so its interval does not overlap with currently running ingestion jobs.

VMware also recommends to proactively run this command:

Use this command:

vim /etc/cron.d/pace_rollup_task
Add a new line to the end of the file similar to:

MAILTO=""
0 2 * * * upace /opt/vmware/pace/druid-config/reindex-cf.py -f /opt/vmware/pace/druid-config/reindex-cf-template.json -d correlated_flow -g "day" -n 1 --guard-interval-days 1 -s /opt/vmware/pace/druid-config/rollup_success_correlated_flow
0 3 * * * upace /opt/vmware/pace/druid-config/reindex-cf.py -f /opt/vmware/pace/druid-config/reindex-cf-template.json -d correlated_flow_viz -g "day" -n 1 --guard-interval-days 1 -s /opt/vmware/pace/druid-config/rollup_success_correlated_flow_viz
0 * * * * upace /opt/vmware/pace/druid-config/reindex-cf.py -f /opt/vmware/pace/druid-config/reindex-cf-template.json -d correlated_flow -g "hour" -n 1
0 5 * * 0 upace /opt/vmware/pace/druid-config/reindex-cf.py -f /opt/vmware/pace/druid-config/reindex-cf-template.json -d correlated_flow -g "week" -n 7 --guard-interval-days 1 -s /opt/vmware/pace/druid-config/rollup_success_correlated_flow_weekly
0 6 * * 0 upace /opt/vmware/pace/druid-config/reindex-cf.py -f /opt/vmware/pace/druid-config/reindex-cf-template.json -d correlated_flow_viz -g "week" -n 7 --guard-interval-days 7 -s /opt/vmware/pace/druid-config/rollup_success_correlated_flow_viz_weekly

~
~
Restart the cron service with this command:

/etc/init.d/cron restart

Note: Wait for up to 24 hours until the roll up job is scheduled.

To mitigate the issue caused by late or early timestamps, run these commands:

Navigate to the druid config folder:

cd /opt/vmware/pace/druid-config
Open the druid kafka ingestion file for correlated flow viz:

vim correlated_flow_visualization_kafka_supervisor.json
Search for the string "ioConfig":

/ioConfig
Add two new items "lateMessageRejectionPeriod":"PT30M" and "earlyMessageRejectionPeriod":"PT30M" in the bracket.

For example:

"ioConfig": {"topic": "correlated_flow", "lateMessageRejectionPeriod":"PT30M", "earlyMessageRejectionPeriod":"PT30M", "replicas": 1, "taskCount": 1, "taskDuration": "PT10M",
Save and Exit:

ESC then :wq!
POST the updated json to druid:

curl -XPOST -H'Content-Type: application/json' -d @correlated_flow_visualization_kafka_supervisor.json http://localhost:8090/druid/indexer/v1/supervisor
Open the druid kafka ingestion file for correlated flow:

vim correlated_flow_kafka_supervisor.json
Search for the string "ioConfig":

/ioConfig
Add two new items "lateMessageRejectionPeriod":"PT30M" and "earlyMessageRejectionPeriod":"PT30M" in the bracket. For example

"ioConfig": {"topic": "correlated_flow", "lateMessageRejectionPeriod":"PT30M", "earlyMessageRejectionPeriod":"PT30M", "replicas": 1, "taskCount": 1, "taskDuration": "PT10M",
Save and Exit:

ESC, then :wq!
POST the updated json to druid:

curl -XPOST -H'Content-Type: application/json' -d @correlated_flow_kafka_supervisor.json http://localhost:8090/druid/indexer/v1/supervisor

Additional Information

Impact/Risks:
The increased number of segments impact the Visualization API Query performance and hence the loading of the group view for 12 hours or 24 hours return with the suggestion to use filters.

Recommendation jobs may timeout or fail due to worse query performance.