Service Start Stalls Prior to cb-supervisord Causing Timeout

Products

Carbon Black EDR (formerly Cb Response)

Issue/Introduction

cbcluster or cb-enterprise services time out on startup. Services stall at prior to starting up the cb-enterprise service.

Symptoms include:

Minions claim postgres is not reachable in /var/log/cb/datagrid/debug.log (clustered only).
Services are hanging for a few minutes prior to the cb-supervisord service.
- Running "journalctl -fexu cb-enterprise" in a separate terminal session shows the service hanging at session closed for user cb.
```
runuser[316641]: pam_unix(runuser:session): session closed for user cb
```
- A few minutes of time difference between startup script header and start of cb-supervisord.
```
cat /var/log/messages | grep -E 'Carbon Black EDR is a surveillance|Starting cb-supervisord'
```

Environment

Carbon Black EDR Server: All Versions

Cause

One of the pre-startup check scripts is taking too long to complete before services can attempt startup.

Resolution

Run the following command to find which service script is taking the longest amount of time.

for i in 'ServerStatusCheck' 'Cleanup' 'UpdateChkConfig' 'UpdateEtcHosts' 'CopyErlangCookie' 'InitModulestore' 'InitLoggers' 'VersionCheck' 'SolrCoreSetup' 'HandleSolrSSLKeyCert' 'HandleSolrSSLCert' 'GenerateRedisConfig' 'HandleRedisSSLCerts' 'PopulateNginxProps' 'ResetFilePermissions' 'GenerateCrontab' 'EnableRabbitMqPlugins' 'UpdateSELinux' 'GenerateRabbitMqClusterConfig''UpdateSysctl'; do echo "====Executing $i====" && time $(/usr/share/cb/virtualenv/bin/python -m cb.maintenance.cbstartup.main --single-action=$i); done

Most of these scripts are simple checks. There are two that have reported causing issues in some environments.
1. UpdateSELinux
  1. This command should take a few seconds to complete (~2-6 seconds).
```
time semanage port -a -t rabbitmq_port_t -p tcp 5004
```
  2. Two potential causes of this taking a long time.
    1. Many custom selinux policies have been added outside the base OS and product. An admin should review all policies.
    2. Disk speed is unable to handle the semanage call.
      1. For VM's make sure resources are not shared.
      2. Cloud environments such as AWS may be the tier chosen. For example an AWS T series has lower EBS bandwidth than an M series, where EBS bandwidth defines how fast the instance can access disk storage.
2. ResetFilePermissions
  1. There are too many files in one or more of the paths. Use the following command to help narrow down common locations.
```
for i in '/var/cb/data/' '/var/cb/data/live-response/' '/var/cb/data/solr/' '/var/cb/data/modulestore/' '/var/log/cb/'; do echo -e "$(find $i -type f | wc -l) \t$i"; done
```
  2. How to Clear Disk Space Safely on the EDR Server.
Open a support case with the following information if further assistance is needed.
1. An strace on the pre-startup that is taking the longest, where the command here is using ResetFilePermissions as an example.
```
strace -s 500 -I 1 -fr -o startup_stage.strace /usr/share/cb/virtualenv/bin/python -m cb.maintenance.cbstartup.main --single-action=ResetFilePermissions
```
2. Generate Server Diagnostic Logs for On-Prem (i.e., CBDiags)
3. Attach the startup_stage.strace and cbdiag zip to the case.

Additional Information

UpdateSELinux: Uses semanage to update selinux policies for the EDR services. Semanage reads all policies, changes them in memory then writes out to temp files.
ResetFilePermissions: Checks for all files and updates the permissions to avoid permissions level startup issues or access to the database and log files.