SSP(Security Services Platform) deployment failed and nsx-config pods are in CrashLoopBackOff

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

1. nsx-config pods crash because their init container wait-for-druid-supervisor-ready crashes.

You can verify by running these commands as root on SSPI via ssh:

k -n nsxi-platform  get pods | grep nsx-config

nsxi-platform       nsx-config-0-0                                                   0/2     Init:CrashLoopBackOff   126 (21s ago)   19h     172.20.xx.xx       mr027860-vsx-md-0-8m5r7-wp99t   <none>           <none>
nsxi-platform       nsx-config-1-0                                                   0/2     Init:CrashLoopBackOff   126 (87s ago)   19h     172.20.xx.xx       mr027860-vsx-md-0-8m5r7-wp99t   <none>           <none>

2. Check the logs for wait-for-druid-supervisor-ready container to get the unhealthy supervisor name. The possible unhealthy supervisors are pace2druid_policy_intent_config or pace2druid_manager_realization_config. In this example, the unhealthy supervisor name is pace2druid_policy_intent_config.

k -n nsxi-platform logs nsx-config-0-0 -c wait-for-druid-supervisor-ready

INFO:root:==============Checking the pace2druid_policy_intent_config status=============
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): 10.96.xx.xx:8281
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:1020: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.96.xx.xx'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  warnings.warn(
DEBUG:urllib3.connectionpool:https://10.96.xx.xx:8281 "GET /druid/indexer/v1/supervisor/pace2druid_policy_intent_config/status HTTP/1.1" 200 None
INFO:root:{"id":"pace2druid_policy_intent_config","generationTime":"2025-02-13T16:08:58.335Z","payload":{"dataSource":"pace2druid_policy_intent_config","stream":"pace2druid_policy_intent_config","partitions":0,"replicas":1,"durationSeconds":600,"activeTasks":[],"publishingTasks":[],"minimumLag":{},"aggregateLag":0,"suspended":false,"healthy":false,"state":"UNHEALTHY_SUPERVISOR","detailedState":"UNHEALTHY_SUPERVISOR","recentErrors":[{"timestamp":"2025-02-13T15:42:58.040Z","exceptionClass":"org.apache.druid.java.util.common.ISE","message":"Listener [KafkaSupervisor-pace2druid_policy_intent_config] already registered","streamException":false},{"timestamp":"2025-02-13T15:52:58.040Z","exceptionClass":"org.apache.druid.java.util.common.ISE","message":"Listener [KafkaSupervisor-pace2druid_policy_intent_config] already registered","streamException":false},{"timestamp":"2025-02-13T16:02:58.041Z","exceptionClass":"org.apache.druid.java.util.common.ISE","message":"Listener [KafkaSupervisor-pace2druid_policy_intent_config] already registered","streamException":false}]}}
INFO:root:Supervisor: pace2druid_policy_intent_config status: UNHEALTHY_SUPERVISOR
INFO:root:Supervisor pace2druid_policy_intent_config is not ready

3. Check the druid-coordinator pod logs and to check that supervisor pace2druid_policy_intent_config is unhealthy.

k -n nsxi-platform get pods | grep druid-coordinator
druid-coordinator-756dcf57b4-n22bz                               1/1     Running     0               4d10h

k -n nsxi-platform logs druid-coordinator-756dcf57b4-n22bz

2025-02-13T16:24:51,237 WARN [KafkaSupervisor-pace2druid_policy_intent_config-Reporting-0] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - Exception while getting current/latest sequences
org.apache.druid.indexing.seekablestream.common.StreamException: java.lang.IllegalStateException: No current assignment for partition pace2druid_policy_intent_config-0
        at org.apache.druid.indexing.kafka.KafkaRecordSupplier.wrapExceptions(KafkaRecordSupplier.java:374) ~[?:?]
        at org.apache.druid.indexing.kafka.KafkaRecordSupplier.wrapExceptions(KafkaRecordSupplier.java:380) ~[?:?]
        at org.apache.druid.indexing.kafka.KafkaRecordSupplier.seekToLatest(KafkaRecordSupplier.java:138) ~[?:?]
        at org.apache.druid.indexing.kafka.supervisor.KafkaSupervisor.updatePartitionLagFromStream(KafkaSupervisor.java:412) ~[?:?]
        at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.updateCurrentAndLatestOffsets(SeekableStreamSupervisor.java:4049) ~[druid-indexing-service-31.0.0.jar:31.0.0]
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
        at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
        at java.base/java.lang.Thread.run(Unknown Source) [?:?]
Caused by: java.lang.IllegalStateException: No current assignment for partition pace2druid_policy_intent_config-0
        at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:369) ~[?:?]
        at org.apache.kafka.clients.consumer.internals.SubscriptionState.lambda$requestOffsetReset$3(SubscriptionState.java:647) ~[?:?]
        at java.base/java.util.ArrayList.forEach(Unknown Source) ~[?:?]
        at org.apache.kafka.clients.consumer.internals.SubscriptionState.requestOffsetReset(SubscriptionState.java:645) ~[?:?]
        at org.apache.kafka.clients.consumer.KafkaConsumer.seekToEnd(KafkaConsumer.java:1652) ~[?:?]
        at org.apache.druid.indexing.kafka.KafkaRecordSupplier.lambda$seekToLatest$7(KafkaRecordSupplier.java:138) ~[?:?]
        at org.apache.druid.indexing.kafka.KafkaRecordSupplier.lambda$wrapExceptions$15(KafkaRecordSupplier.java:381) ~[?:?]
        at org.apache.druid.indexing.kafka.KafkaRecordSupplier.wrapExceptions(KafkaRecordSupplier.java:371) ~[?:?]
        ... 10 more

4. Check the supervisor status from druid-broker pod:

k -n nsxi-platform get pods | grep druid-broker
druid-broker-678c997b89-j4lkc                 1/1     Running     1 (4d19h ago)   9d

k -n nsxi-platform exec -it <druid-broker-pod-name from above command output> -- curl "https://druid-router:8280/druid/indexer/v1/supervisor/<unhealty-supervisor-name>/status" -k

Example response:

{"id":"pace2druid_policy_intent_config","generationTime":"2025-02-19T21:24:27.486Z","payload":{"dataSource":"pace2druid_policy_intent_config","stream":"pace2druid_policy_intent_config","partitions":1,"replicas":1,"durationSeconds":600,"activeTasks":[{"id":"index_kafka_pace2druid_policy_intent_config_da0cf7cde99f4fb_elgecccp","startingOffsets":{"0":18},"startTime":"2025-02-19T21:15:50.983Z","remainingSeconds":83,"type":"ACTIVE","currentOffsets":{"0":18},"lag":{"0":0}}],"publishingTasks":[],"latestOffsets":{"0":18},"minimumLag":{"0":0},"aggregateLag":0,"offsetsLastUpdated":"2025-02-19T21:23:59.556Z","suspended":false,"healthy":true,"state":"UNHEALTHY_SUPERVISOR","detailedState":"UNHEALTHY_SUPERVISOR","recentErrors":[]}}

5. Check the state and detailedState field from the response. If both field values are "UNHEALTHY_SUPERVISOR", Follow the Resolution section to reset the unhealthy supervisor.

Environment

SSP 5.0.0

Cause

Druid failed to unregister some of the supervisors, when a terminate all supervisors command is issued, probably because the druid coordinator pod is not fully up at the beginning of deployment. Druid failed to register the new supervisor because the old one hasn't unregistered.

Resolution

This is a known issue affecting SSP deployment, open a support ticket with Broadcom to resolve the issue.