Cloud Controller fails to start due to a race condition between nginx_cc and ccng_monit_http

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Bosh deployment fails, caused by a race condition, with the following error indicating cloud_controller_ng, ccng_monit_http_healthcheck, and nginx_cc jobs could not start. Given this is a race condition it is possible these jobs will report success, however you should see at least 1 or 2 of them fail in order to match the symptoms of this knowledge article.

Task 184 | 04:48:43 | L starting jobs: cloud_controller/685460c3-a8c8-4c8c-ae64-70292f62b382 (0) (canary) (00:07:10)
                    L Error: 'cloud_controller/685460c3-a8c8-4c8c-ae64-70292f62b382 (0)' is not running after update. Review logs for failed jobs: cloud_controller_ng, ccng_monit_http_healthcheck, nginx_cc
Task 184 | 04:53:43 | Error: 'cloud_controller/685460c3-a8c8-4c8c-ae64-70292f62b382 (0)' is not running after update. Review logs for failed jobs: cloud_controller_ng, ccng_monit_http_healthcheck, nginx_cc

/var/vcap/sys/log/cloud_controller_ng/ccng_monit_http_healthcheck.stdout.log log file will report the following log pattern. Please note the log line “Will restart CC over on repeated failures” is normal and should exist when ccng monit healthcheck starts up. The log line that indicates curl failed with exit code 7 is the symptom that matches this bug.

2024-01-05 04:48:44.677964337+00:00 Will restart CC over on repeated failures
2024-01-05 04:48:44.686089362+00:00 ccng_monit_http_healthcheck failed to curl <https://10.225.58.72:9024/healthz>: exit code 7
2024-01-05 04:48:44.687590286+00:00 :: Healthcheck failed consistently, restarting CC

Exit code 7 indicates the curl command failed to reach the local nginx process ( nginx_cc job ) because it received a “Connection Refused” when connecting to port 9024. This occurs because nginx has not started listening on port 9024 yet. Monit is responsible for starting all the jobs and if nginx_cc is started 3 or 4 seconds after ccng_monit_http_healthcheck job then ccng healthcheck will fail and restart cloud_controller_ng as well as nginx_cc job. This cycle may continue indefinitely.

Resolution

The fix for this issue is in this commit which will ignore connection refused errors when connecting to nginx and continue to retry the healthcheck. The fix can be identified in the capi-release release notes as “CC healthcheck tolerates nginx unavailability”.

TAS fixed versions

2.11.51 - capi-release 1.109.33
2.13.33 - capi-release 1.133.16
4.0.15 - capi-release 1.169.0
5.0.5 - capi-release 1.169.0

To workaround this error you can apply this patch manually to file /var/vcap/jobs/cloud_controller_ng/bin/ccng_monit_http_healthcheck on the cloud controller VM once the bosh deployment has started pre-start and before it fails during “starting jobs”. After applying the patch you will still see exit code 7 errors in the ccng_monit_http_healthcheck.stdout.log mentioned above, however you should not see "Healthcheck failed consistently, restarting CC".

One method to patch the cloud controller manually is to use this sed command

BOSH ssh into the cloud controller vm in the "starting jobs" state
get root access
- ```
sudo su -
```

run sed command

sed -E -i'.backup' '36s/echo \$\(date --rfc-3339=ns\) "ccng_monit_http_healthcheck failed to curl <\$\{URL\}>: exit code \$status"/echo "\$(date --rfc-3339=ns) ccng_monit_http_healthcheck failed to curl <\${URL}>: exit code \$status"\n    if [[ $status != 7 ]] ; then\n      exit $status\n    fi/;37s/^ +exit \$status$//' /var/vcap/jobs/cloud_controller_ng/bin/ccng_monit_http_healthcheck

The above sed command will create a /var/vcap/jobs/cloud_controller_ng/bin/ccng_monit_http_healthcheck.backup file that can be used to restore from if there are any copy/paste errors. NOTE - Running this command more than once will overwrite the backup file.