Some TAS instances would go to failing state after BBR is run

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

The user would see some TAS instances failing after BBR is started, then they are recovered automatically after a while. Messages similar to following could be seen in instance logs:

clock_global/cloud_controller_clock

D, [2021-03-02T01:03:35.508000 #7] DEBUG -- : Got TERM signal
I, [2021-03-02T01:03:35.508062 #7]  INFO -- : Gracefully shutting down

cloud_controller/cloud_controller_ng

{"timestamp":1612832801.3210309,"message":"Caught signal TERM","log_level":"warn","source":"cc.runner","data":{},"thread_id":47271277168140,"fiber_id":47271324258700,"process_id":9,"file":"/var/vcap/data/pa
ckages/cloud_controller_ng/1bb2ad4b3260ad72bb2d3d348d9ce20a0d65fb0d/cloud_controller_ng/lib/cloud_controller/runner.rb","lineno":88,"method":"block (3 levels) in trap_signals"}
{"timestamp":1612832801.3213882,"message":"Stopping Thin Server.","log_level":"info","source":"cc.runner","data":{},"thread_id":47271277168140,"fiber_id":47271324258700,"process_id":9,"file":"/var/vcap/data
/packages/cloud_controller_ng/1bb2ad4b3260ad72bb2d3d348d9ce20a0d65fb0d/cloud_controller_ng/lib/cloud_controller/runner.rb","lineno":185,"method":"stop_thin_server"}
{"timestamp":1612832801.3218896,"message":"Stopping EventMachine","log_level":"info","source":"cc.runner","data":{},"thread_id":47271277168140,"fiber_id":47271324258700,"process_id":9,"file":"/var/vcap/data
/packages/cloud_controller_ng/1bb2ad4b3260ad72bb2d3d348d9ce20a0d65fb0d/cloud_controller_ng/lib/cloud_controller/runner.rb","lineno":104,"method":"stop!"}

diego_database/policy_server

{"timestamp":"2021-03-02T01:03:34.666198936Z","level":"info","source":"cfnetworking.policy-server","message":"cfnetworking.policy-server.exited","data":{}}
{"timestamp":"2021-03-02T01:31:22.435053987Z","level":"info","source":"cfnetworking.policy-server","message":"cfnetworking.policy-server.getting db connection","data":{}}

Environment

Product Version: 2.9

Resolution

As mentioned in the BBR doc, the release job can have a pre-backup-lock script that stops any processes that could make changes to the components being backed up. So it's expected behavior that some jobs on instances implement a pre-backup-lock script that will be stopped by BBR which in turn is shown as failing by BOSH.

The table below shows TAS instances and their corresponding jobs which will be stopped by pre-backup-lock script:

Instance	Job
diego_brain	tps
diego_database	policy_server
cloud_controller	cloud_controller_worker_local_1 cloud_controller_worker_local_2 cloud_controller_ng ccng_monit_http_healthcheck
cloud_controller_worker	cloud_controller_worker_1
clock_global	cc_deployment_updater cloud_controller_clock

The following table shows instances and jobs which will be put into read-only mode by pre-backup-lock script:

Instance	Job
credhub	credhub
uaa	uaa
cloud_controller	routing-api