Druid Configuration Broker continuously restarts when there are many groups and/or virtual machines
book
Article ID: 319821
calendar_today
Updated On:
Products
VMware vDefend FirewallVMware vDefend Firewall with Advanced Threat Prevention
Issue/Introduction
Symptoms: Recommendations cannot be run, Continuous Monitoring job will be in Error sometimes and Visualization Canvas and filters do not work or very slow.
Log: 1) napp-k get pods will show something like below with many restarts
root@systest-runner:~[1944]# napp-k get events --field-selector involvedObject.name=druid-config-broker-755f689fdf-n99d7 LAST SEEN TYPE REASON OBJECT MESSAGE 5m20s Warning Unhealthy pod/druid-config-broker-755f689fdf-n99d7 Liveness probe failed: Get "https://192.x.x.x:8282/status/health": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 45m Warning Unhealthy pod/druid-config-broker-755f689fdf-n99d7 Readiness probe failed: Get "https://192.x.x.x:8282/status/health": net/http: request canceled (Client.Timeout exceeded while awaiting headers) 20s Warning Unhealthy pod/druid-config-broker-755f689fdf-n99d7 Readiness probe failed: Get "https://192.x.x.x:8282/status/health": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Environment
VMware NSX-T Data Center
Cause
The database druid-config-broker was overloaded due to the scale of data and kept restarting.
Resolution
Druid Config Brokers are removed in 420. The next release will no longer have Druid Broker pods.
Workaround: The workaround is to add an additional Druid Config Broker to handle the scale
Run the following command on NSX manager as root: export KUBE_EDITOR=vim.tiny napp-k edit deployment druid-config-broker
Find "spec->replicas", and increase the replica count by 1.
If the issue persists even after adding the replicas , look into the logs of Druid Config Broker and if you see lines similar to :
2024-03-13T20:08:33,410 WARN [main] org.apache.druid.discovery.DruidLeaderClient - Request[https://192.168.4.10:8281/druid/coordinator/v1/lookups/config/__default?detailed=true] received a 503 Service Unavailable response. Attempt 4/5 2024-03-13T20:08:33,416 WARN [main] org.apache.druid.discovery.DruidLeaderClient - Request[https://192.168.4.10:8281/druid/coordinator/v1/lookups/config/__default?detailed=true] received a 503 Service Unavailable response. Attempt 5/5
It means the old druid coordinator pod was in error state due to a SysMonitor issue, and the new druid broker pod was not able to connect to it. Hence delete the druid co-ordinator pod and retry , the issue will be rectified.
Additional Information
Impact/Risks: On large deployments where there are many NSX Groups and Virtual Machines , sometimes the Druid configuration broker pod( druid-config-broker-xxxx pod), keeps restarting. This has been observed when the scale reaches 5000 groups, 5000 VM's, 30 physical Servers and 270 hosts.