Failed to start recommendation and the error message indicates it's a Druid exception. The visualization page may also have the same error message.
Symptom:
1. The user can't start a recommendation with a Druid exception error message.
2. Druid pods or zookeeper pods may restarted.
3. In the Druid router pod log, it says it failed to find the Druid broker.
4. Get the Druid router pod name with the command 'napp-k get pod | grep druid-router'. Then run the command 'napp-k exec -it <druid-router name> bash -- curl https://druid-router:8280/druid/router/v1/brokers -k'. The command result should return an empty list like '{"druid/broker":[]}'
Log:
Run the following command in the NSX manager as root:
(1) Get the Druid router log 'napp-k logs <druid-router name>'
You should see the similar error log
2024-10-08T12:01:00,837 ERROR [qtp355366659-149] org.apache.druid.server.router.QueryHostFinder - No server found for serviceName[druid/broker]. Using backup
2024-10-08T12:01:00,837 ERROR [qtp355366659-149] org.apache.druid.server.router.QueryHostFinder - No backup found for serviceName[druid/broker]. Using default[druid/broker]
2024-10-08T12:01:00,837 ERROR [qtp355366659-149] org.apache.druid.server.router.QueryHostFinder - Catastrophic failure! No brokers found at all! Failing request!: {class=org.apache.druid.server.router.QueryHostFinder}
2024-10-08T12:01:00,837 WARN [qtp355366659-149] org.apache.druid.server.AsyncQueryForwardingServlet - Unexpected exception occurs
org.apache.druid.query.QueryInterruptedException: There are no available brokers for query[GroupByQuery{dataSource='pace2druid_manager_realization_config', querySegmentSpec=LegacySegmentSpec{intervals=[2024-10-01T00:01:00.000Z/2024-10-08T12:00:38.000Z]}, virtualColumns=[ExpressionVirtualColumn{name='VC_CONCATsource_groups', expression='array_to_string(source_groups,'@@')', outputType=STRING}, ExpressionVirtualColumn{name='VC_CONCATdestination_groups', expression='array_to_string(destination_groups,'@@')', outputType=STRING}, ExpressionVirtualColumn{name='VC_CONCATservices_array', expression='array_to_string(services_array,'@@')', outputType=STRING}], limitSpec=NoopLimitSpec, dimFilter=(rule_id IN (2, 4) && site_id = ecdd91ff-c84c-4dce-9779-1468bde44730 && config_type = MANAGER_DFW_RULE), granularity=AllGranularity, dimensions=[DefaultDimensionSpec{dimension='rule_id', outputName='rule_id', outputType='STRING'}], aggregatorSpecs=[StringLastAggregatorFactory{fieldName='VC_CONCATsource_groups', name='VC_CONCATsource_groups', maxStringBytes=1024, timeColumn=__time}, StringLastAggregatorFactory{fieldName='VC_CONCATdestination_groups', name='VC_CONCATdestination_groups', maxStringBytes=1024, timeColumn=__time}, StringLastAggregatorFactory{fieldName='VC_CONCATservices_array', name='VC_CONCATservices_array', maxStringBytes=1024, timeColumn=__time}, LongLastAggregatorFactory{name='lastUpdateTime', fieldName='__time', timeColumn='__time'}, LongLastAggregatorFactory{name='latest_last_modified_time', fieldName='latest_last_modified_time', timeColumn='__time'}, LongLastAggregatorFactory{name='deleted', fieldName='deleted', timeColumn='__time'}, LongLastAggregatorFactory{name='latest_revision', fieldName='latest_revision', timeColumn='__time'}], postAggregatorSpecs=[ExpressionPostAggregator{name='source_groups', expression='string_to_array(VC_CONCATsource_groups,'@@')', ordering=null, outputType=null}, ExpressionPostAggregator{name='destination_groups', expression='string_to_array(VC_CONCATdestination_groups,'@@')', ordering=null, outputType=null}, ExpressionPostAggregator{name='services_array', expression='string_to_array(VC_CONCATservices_array,'@@')', ordering=null, outputType=null}], havingSpec=null, context={queryId=PROCESSING-RAWFLOW-3-9d5f2cf2-7917-4a6f-aa63-27cc0b633564}}].Please check that your brokers are running and healthy.
at org.apache.druid.query.QueryInterruptedException.wrapIfNeeded(QueryInterruptedException.java:113) ~[druid-processing-29.0.1.jar:29.0.1]
at org.apache.druid.server.AsyncQueryForwardingServlet.handleException(AsyncQueryForwardingServlet.java:117) ~[druid-services-29.0.1.jar:29.0.1]
at org.apache.druid.server.AsyncQueryForwardingServlet.service(AsyncQueryForwardingServlet.java:271) ~[druid-services-29.0.1.jar:29.0.1]
NAPP 4.2.0
NSX 4.1.2
Kubernetes tool version-v.1.23.8+vmware.3
Druid router can't find Druid broker by zookeeper.
There's no fix at the moment.
Workaround:
1. Get the Druid pod names with 'napp-k get pod | grep druid' and delete all the pods with name prefixes (druid-router, druid-broker, druid-coordinator, druid-historical and druid-middle-manager) using the command 'napp-k delete pod <pod name>'
2. Get the Druid router pod name with the command 'napp-k get pod | grep druid-router'. Then run the command 'napp-k exec -it <druid-router name> bash -- curl https://druid-router:8280/druid/router/v1/brokers -k'. The command result should return a non-empty list like '{"druid/broker":["xyx.yxy.y.yx:8282"]}'