Druid Configuration Broker continuously restarts when there are many groups and/or virtual machines

search cancel

Druid Configuration Broker continuously restarts when there are many groups and/or virtual machines

book

Article ID: 319821

calendar_today

Updated On:

Products

VMware vDefend Firewall VMware vDefend Firewall with Advanced Threat Prevention

Issue/Introduction

Symptoms:
Recommendations cannot be run, Continuous Monitoring job will be in Error sometimes and Visualization Canvas and filters do not work or very slow.

Log:
1) napp-k get pods will show something like below with many restarts

druid-config-broker-755f689fdf-n99d7 1/1 Running 220 (4m26s ago) 4d11h

2) Pod health probes will be failing

root@systest-runner:~[1944]# napp-k get events --field-selector involvedObject.name=druid-config-broker-755f689fdf-n99d7
LAST SEEN TYPE REASON OBJECT MESSAGE
5m20s Warning Unhealthy pod/druid-config-broker-755f689fdf-n99d7 Liveness probe failed: Get "https://192.x.x.x:8282/status/health": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
45m Warning Unhealthy pod/druid-config-broker-755f689fdf-n99d7 Readiness probe failed: Get "https://192.x.x.x:8282/status/health": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
20s Warning Unhealthy pod/druid-config-broker-755f689fdf-n99d7 Readiness probe failed: Get "https://192.x.x.x:8282/status/health": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Environment

VMware NSX-T Data Center

Cause

The database druid-config-broker was overloaded due to the scale of data and kept restarting.

Resolution

Druid Config Brokers are removed in 420. The next release will no longer have Druid Broker pods.

Workaround:
The workaround is to add an additional Druid Config Broker to handle the scale

Run the following command on NSX manager as root:
export KUBE_EDITOR=vim.tiny
napp-k edit deployment druid-config-broker

Find "spec->replicas", and increase the replica count by 1.

If the issue persists even after adding the replicas , look into the logs of Druid Config Broker and if you see lines similar to :

2024-03-13T20:08:33,410 WARN [main] org.apache.druid.discovery.DruidLeaderClient - Request[https://192.x.x.x:8281/druid/coordinator/v1/lookups/config/__default?detailed=true] received a 503 Service Unavailable response. Attempt 4/5
2024-03-13T20:08:33,416 WARN [main] org.apache.druid.discovery.DruidLeaderClient - Request[https://192.x.x.x:8281/druid/coordinator/v1/lookups/config/__default?detailed=true] received a 503 Service Unavailable response. Attempt 5/5

It means the old druid coordinator pod was in error state due to a SysMonitor issue, and the new druid broker pod was not able to connect to it. Hence delete the druid co-ordinator pod and retry , the issue will be rectified.

Additional Information

Impact/Risks:
On large deployments where there are many NSX Groups and Virtual Machines , sometimes the Druid configuration broker pod( druid-config-broker-xxxx pod), keeps restarting. This has been observed when the scale reaches 5000 groups, 5000 VM's, 30 physical Servers and 270 hosts.

Feedback

thumb_up Yes

thumb_down No