Cloud server is going into maintenance mode randomly stoping users from logging in.
This may occur daily around the same time or multiple times a day
UI comes back up after autohealing, generally within 10 minutes
Similar message to the below in /var/log/cb/enterprise/enterprise.log
2018-08-30 13:40:24 [17053] <warning> cb.enterprise.tasks.server_health_monitor.indicators.pg_stats - The following 1 queries have been running for over 60 seconds:
2018-08-30 13:40:24 [17053] <warning> cb.enterprise.tasks.server_health_monitor.indicators.pg_stats - 1) [query -> UPDATE sensor_registrations SET build_id = $1,event_log_flush_time =
$2,group_id = $3,id = $4,liveresponse_init = $5,network_isolation_enabled = $6,restart_queued = $7,uninstall = $8 WHERE id = $9], [pid -> 13428], [usename -> cb], [application_name ->
PostgreSQL JDBC Driver], [client_addr -> 127.0.0.1], [client_hostname -> None], [client_port -> 53699], [query_start -> 2018-08-30 13:39:20], [query_duration -> 0:01:03]
Environment
Hosted EDR: 6.2.3 and Higher
Cause
Sensor check ins with newly added health checks cause performance issues seen with High CPU spikes. This causes the UI to fail and we bring it back up with an autoheal.
Resolution
Extend the check in times for the sensor in /etc/cb/cb.conf to 240 seconds
MinSensorCheckinDelaySec=240
Additional Information
This has only been seen through Cloud so far
The parameter extends the amount of time the sensors will try to check in reducing the stress on the Server