1. nsx-config pods crash because their init container wait-for-druid-supervisor-ready crashes.
root on SSPI via ssh:k -n nsxi-platform get pods | grep nsx-confignsxi-platform nsx-config-0-0 0/2 Init:CrashLoopBackOff 126 (21s ago) 19h 172.20.xx.xx mr027860-vsx-md-0-8m5r7-wp99t <none> <none>nsxi-platform nsx-config-1-0 0/2 Init:CrashLoopBackOff 126 (87s ago) 19h 172.20.xx.xx mr027860-vsx-md-0-8m5r7-wp99t <none> <none>
2. Check the logs for wait-for-druid-supervisor-ready container to get the unhealthy supervisor name. The possible unhealthy supervisors are pace2druid_policy_intent_config or pace2druid_manager_realization_config. In this example, the unhealthy supervisor name is pace2druid_policy_intent_config.
k -n nsxi-platform logs nsx-config-0-0 -c wait-for-druid-supervisor-readyINFO:root:==============Checking the pace2druid_policy_intent_config status=============DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): 10.96.xx.xx:8281/usr/lib/python3/dist-packages/urllib3/connectionpool.py:1020: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.96.xx.xx'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warningswarnings.warn(DEBUG:urllib3.connectionpool:https://10.96.xx.xx:8281 "GET /druid/indexer/v1/supervisor/pace2druid_policy_intent_config/status HTTP/1.1" 200 NoneINFO:root:{"id":"pace2druid_policy_intent_config","generationTime":"2025-02-13T16:08:58.335Z","payload":{"dataSource":"pace2druid_policy_intent_config","stream":"pace2druid_policy_intent_config","partitions":0,"replicas":1,"durationSeconds":600,"activeTasks":[],"publishingTasks":[],"minimumLag":{},"aggregateLag":0,"suspended":false,"healthy":false,"state":"UNHEALTHY_SUPERVISOR","detailedState":"UNHEALTHY_SUPERVISOR","recentErrors":[{"timestamp":"2025-02-13T15:42:58.040Z","exceptionClass":"org.apache.druid.java.util.common.ISE","message":"Listener [KafkaSupervisor-pace2druid_policy_intent_config] already registered","streamException":false},{"timestamp":"2025-02-13T15:52:58.040Z","exceptionClass":"org.apache.druid.java.util.common.ISE","message":"Listener [KafkaSupervisor-pace2druid_policy_intent_config] already registered","streamException":false},{"timestamp":"2025-02-13T16:02:58.041Z","exceptionClass":"org.apache.druid.java.util.common.ISE","message":"Listener [KafkaSupervisor-pace2druid_policy_intent_config] already registered","streamException":false}]}}INFO:root:Supervisor: pace2druid_policy_intent_config status: UNHEALTHY_SUPERVISORINFO:root:Supervisor pace2druid_policy_intent_config is not ready
3. Check the druid-coordinator pod logs and to check that supervisor pace2druid_policy_intent_config is unhealthy.
k -n nsxi-platform get pods | grep druid-coordinator
druid-coordinator-756dcf57b4-n22bz 1/1 Running 0 4d10h
k -n nsxi-platform logs druid-coordinator-756dcf57b4-n22bz2025-02-13T16:24:51,237 WARN [KafkaSupervisor-pace2druid_policy_intent_config-Reporting-0] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - Exception while getting current/latest sequencesorg.apache.druid.indexing.seekablestream.common.StreamException: java.lang.IllegalStateException: No current assignment for partition pace2druid_policy_intent_config-0at org.apache.druid.indexing.kafka.KafkaRecordSupplier.wrapExceptions(KafkaRecordSupplier.java:374) ~[?:?]at org.apache.druid.indexing.kafka.KafkaRecordSupplier.wrapExceptions(KafkaRecordSupplier.java:380) ~[?:?]at org.apache.druid.indexing.kafka.KafkaRecordSupplier.seekToLatest(KafkaRecordSupplier.java:138) ~[?:?]at org.apache.druid.indexing.kafka.supervisor.KafkaSupervisor.updatePartitionLagFromStream(KafkaSupervisor.java:412) ~[?:?]at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.updateCurrentAndLatestOffsets(SeekableStreamSupervisor.java:4049) ~[druid-indexing-service-31.0.0.jar:31.0.0]at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source) [?:?]at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]at java.base/java.lang.Thread.run(Unknown Source) [?:?]Caused by: java.lang.IllegalStateException: No current assignment for partition pace2druid_policy_intent_config-0at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:369) ~[?:?]at org.apache.kafka.clients.consumer.internals.SubscriptionState.lambda$requestOffsetReset$3(SubscriptionState.java:647) ~[?:?]at java.base/java.util.ArrayList.forEach(Unknown Source) ~[?:?]at org.apache.kafka.clients.consumer.internals.SubscriptionState.requestOffsetReset(SubscriptionState.java:645) ~[?:?]at org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1652) ~[?:?]at org.apache.druid.indexing.kafka.KafkaRecordSupplier.lambda$seekToLatest$7(KafkaRecordSupplier.java:138) ~[?:?]at org.apache.druid.indexing.kafka.KafkaRecordSupplier.lambda$wrapExceptions$15(KafkaRecordSupplier.java:381) ~[?:?]at org.apache.druid.indexing.kafka.KafkaRecordSupplier.wrapExceptions(KafkaRecordSupplier.java:371) ~[?:?]... 10 more
4. Check the supervisor status from druid-broker pod:
k -n nsxi-platform get pods | grep druid-broker
druid-broker-678c997b89-j4lkc 1/1 Running 1 (4d19h ago) 9dk -n nsxi-platform exec -it <druid-broker-pod-name from above command output> -- curl "https://druid-router:8280/druid/indexer/v1/supervisor/<unhealty-supervisor-name>/status" -kExample response:{"id":"pace2druid_policy_intent_config","generationTime":"2025-02-19T21:24:27.486Z","payload":{"dataSource":"pace2druid_policy_intent_config","stream":"pace2druid_policy_intent_config","partitions":1,"replicas":1,"durationSeconds":600,"activeTasks":[{"id":"index_kafka_pace2druid_policy_intent_config_da0cf7cde99f4fb_elgecccp","startingOffsets":{"0":18},"startTime":"2025-02-19T21:15:50.983Z","remainingSeconds":83,"type":"ACTIVE","currentOffsets":{"0":18},"lag":{"0":0}}],"publishingTasks":[],"latestOffsets":{"0":18},"minimumLag":{"0":0},"aggregateLag":0,"offsetsLastUpdated":"2025-02-19T21:23:59.556Z","suspended":false,"healthy":true,"state":"UNHEALTHY_SUPERVISOR","detailedState":"UNHEALTHY_SUPERVISOR","recentErrors":[]}}
5. Check the state and detailedState field from the response. If both field values are "UNHEALTHY_SUPERVISOR", Follow the Resolution section to reset the unhealthy supervisor.
SSP 5.0.0
Druid failed to unregister some of the supervisors, when a terminate all supervisors command is issued, probably because the druid coordinator pod is not fully up at the beginning of deployment. Druid failed to register the new supervisor because the old one hasn't unregistered.
This is a known issue affecting SSP deployment, open a support ticket with Broadcom to resolve the issue.