1. nsx-config
pods crash because their init container wait-for-druid-supervisor-ready
crashes.
root
on SSPI via ssh:k -n nsxi-platform get pods | grep nsx-config
nsxi-platform nsx-config-0-0 0/2 Init:CrashLoopBackOff 126 (21s ago) 19h 172.20.xx.xx mr027860-vsx-md-0-8m5r7-wp99t <none> <none>
nsxi-platform nsx-config-1-0 0/2 Init:CrashLoopBackOff 126 (87s ago) 19h 172.20.xx.xx mr027860-vsx-md-0-8m5r7-wp99t <none> <none>
2. Check the logs for wait-for-druid-supervisor-ready
container to get the unhealthy supervisor name. The possible unhealthy supervisors are pace2druid_policy_intent_config
or pace2druid_manager_realization_config
. In this example, the unhealthy supervisor name is pace2druid_policy_intent_config
.
k -n nsxi-platform logs nsx-config-0-0 -c wait-for-druid-supervisor-ready
INFO:root:==============Checking the pace2druid_policy_intent_config status=============
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): 10.96.xx.xx:8281
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:1020: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.96.xx.xx'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
warnings.warn(
DEBUG:urllib3.connectionpool:https://10.96.xx.xx:8281 "GET /druid/indexer/v1/supervisor/pace2druid_policy_intent_config/status HTTP/1.1" 200 None
INFO:root:{"id":"pace2druid_policy_intent_config","generationTime":"2025-02-13T16:08:58.335Z","payload":{"dataSource":"pace2druid_policy_intent_config","stream":"pace2druid_policy_intent_config","partitions":0,"replicas":1,"durationSeconds":600,"activeTasks":[],"publishingTasks":[],"minimumLag":{},"aggregateLag":0,"suspended":false,"healthy":false,"state":"UNHEALTHY_SUPERVISOR","detailedState":"UNHEALTHY_SUPERVISOR","recentErrors":[{"timestamp":"2025-02-13T15:42:58.040Z","exceptionClass":"org.apache.druid.java.util.common.ISE","message":"Listener [KafkaSupervisor-pace2druid_policy_intent_config] already registered","streamException":false},{"timestamp":"2025-02-13T15:52:58.040Z","exceptionClass":"org.apache.druid.java.util.common.ISE","message":"Listener [KafkaSupervisor-pace2druid_policy_intent_config] already registered","streamException":false},{"timestamp":"2025-02-13T16:02:58.041Z","exceptionClass":"org.apache.druid.java.util.common.ISE","message":"Listener [KafkaSupervisor-pace2druid_policy_intent_config] already registered","streamException":false}]}}
INFO:root:Supervisor: pace2druid_policy_intent_config status: UNHEALTHY_SUPERVISOR
INFO:root:Supervisor pace2druid_policy_intent_config is not ready
3. Check the druid-coordinator
pod logs and to check that supervisor pace2druid_policy_intent_config
is unhealthy.
k -n nsxi-platform get pods | grep druid-coordinator
druid-coordinator-756dcf57b4-n22bz 1/1 Running 0 4d10h
k -n nsxi-platform logs druid-coordinator-756dcf57b4-n22bz
2025-02-13T16:24:51,237 WARN [KafkaSupervisor-pace2druid_policy_intent_config-Reporting-0] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - Exception while getting current/latest sequences
org.apache.druid.indexing.seekablestream.common.StreamException: java.lang.IllegalStateException: No current assignment for partition pace2druid_policy_intent_config-0
at org.apache.druid.indexing.kafka.KafkaRecordSupplier.wrapExceptions(KafkaRecordSupplier.java:374) ~[?:?]
at org.apache.druid.indexing.kafka.KafkaRecordSupplier.wrapExceptions(KafkaRecordSupplier.java:380) ~[?:?]
at org.apache.druid.indexing.kafka.KafkaRecordSupplier.seekToLatest(KafkaRecordSupplier.java:138) ~[?:?]
at org.apache.druid.indexing.kafka.supervisor.KafkaSupervisor.updatePartitionLagFromStream(KafkaSupervisor.java:412) ~[?:?]
at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.updateCurrentAndLatestOffsets(SeekableStreamSupervisor.java:4049) ~[druid-indexing-service-31.0.0.jar:31.0.0]
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source) [?:?]
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.base/java.lang.Thread.run(Unknown Source) [?:?]
Caused by: java.lang.IllegalStateException: No current assignment for partition pace2druid_policy_intent_config-0
at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:369) ~[?:?]
at org.apache.kafka.clients.consumer.internals.SubscriptionState.lambda$requestOffsetReset$3(SubscriptionState.java:647) ~[?:?]
at java.base/java.util.ArrayList.forEach(Unknown Source) ~[?:?]
at org.apache.kafka.clients.consumer.internals.SubscriptionState.requestOffsetReset(SubscriptionState.java:645) ~[?:?]
at org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1652) ~[?:?]
at org.apache.druid.indexing.kafka.KafkaRecordSupplier.lambda$seekToLatest$7(KafkaRecordSupplier.java:138) ~[?:?]
at org.apache.druid.indexing.kafka.KafkaRecordSupplier.lambda$wrapExceptions$15(KafkaRecordSupplier.java:381) ~[?:?]
at org.apache.druid.indexing.kafka.KafkaRecordSupplier.wrapExceptions(KafkaRecordSupplier.java:371) ~[?:?]
... 10 more
4. Check the supervisor status from druid-broker
pod:
k -n nsxi-platform get pods | grep druid-broker
druid-broker-678c997b89-j4lkc 1/1 Running 1 (4d19h ago) 9dk -n nsxi-platform exec -it <druid-broker-pod-name from above command output> -- curl "https://druid-router:8280/druid/indexer/v1/supervisor/<unhealty-supervisor-name>/status" -k
Example response:
{"id":"pace2druid_policy_intent_config","generationTime":"2025-02-19T21:24:27.486Z","payload":{"dataSource":"pace2druid_policy_intent_config","stream":"pace2druid_policy_intent_config","partitions":1,"replicas":1,"durationSeconds":600,"activeTasks":[{"id":"index_kafka_pace2druid_policy_intent_config_da0cf7cde99f4fb_elgecccp","startingOffsets":{"0":18},"startTime":"2025-02-19T21:15:50.983Z","remainingSeconds":83,"type":"ACTIVE","currentOffsets":{"0":18},"lag":{"0":0}}],"publishingTasks":[],"latestOffsets":{"0":18},"minimumLag":{"0":0},"aggregateLag":0,"offsetsLastUpdated":"2025-02-19T21:23:59.556Z","suspended":false,"healthy":true,"state":"UNHEALTHY_SUPERVISOR","detailedState":"UNHEALTHY_SUPERVISOR","recentErrors":[]}}
5. Check the state
and detailedState
field from the response. If both field values are "UNHEALTHY_SUPERVISOR", Follow the Resolution section to reset the unhealthy supervisor.
SSP 5.0.0
Druid failed to unregister some of the supervisors, when a terminate all supervisors command is issued, probably because the druid coordinator pod is not fully up at the beginning of deployment. Druid failed to register the new supervisor because the old one hasn't unregistered.
This is a known issue affecting SSP deployment, open a support ticket with Broadcom to resolve the issue.