Some TAS instances would go to failing state after BBR is run
search cancel

Some TAS instances would go to failing state after BBR is run

book

Article ID: 298084

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

The user would see some TAS instances failing after BBR is started, then they are recovered automatically after a while. Messages similar to following could be seen in instance logs:

clock_global/cloud_controller_clock
D, [2021-03-02T01:03:35.508000 #7] DEBUG -- : Got TERM signal
I, [2021-03-02T01:03:35.508062 #7]  INFO -- : Gracefully shutting down

cloud_controller/cloud_controller_ng
{"timestamp":1612832801.3210309,"message":"Caught signal TERM","log_level":"warn","source":"cc.runner","data":{},"thread_id":47271277168140,"fiber_id":47271324258700,"process_id":9,"file":"/var/vcap/data/pa
ckages/cloud_controller_ng/1bb2ad4b3260ad72bb2d3d348d9ce20a0d65fb0d/cloud_controller_ng/lib/cloud_controller/runner.rb","lineno":88,"method":"block (3 levels) in trap_signals"}
{"timestamp":1612832801.3213882,"message":"Stopping Thin Server.","log_level":"info","source":"cc.runner","data":{},"thread_id":47271277168140,"fiber_id":47271324258700,"process_id":9,"file":"/var/vcap/data
/packages/cloud_controller_ng/1bb2ad4b3260ad72bb2d3d348d9ce20a0d65fb0d/cloud_controller_ng/lib/cloud_controller/runner.rb","lineno":185,"method":"stop_thin_server"}
{"timestamp":1612832801.3218896,"message":"Stopping EventMachine","log_level":"info","source":"cc.runner","data":{},"thread_id":47271277168140,"fiber_id":47271324258700,"process_id":9,"file":"/var/vcap/data
/packages/cloud_controller_ng/1bb2ad4b3260ad72bb2d3d348d9ce20a0d65fb0d/cloud_controller_ng/lib/cloud_controller/runner.rb","lineno":104,"method":"stop!"}

diego_database/policy_server
{"timestamp":"2021-03-02T01:03:34.666198936Z","level":"info","source":"cfnetworking.policy-server","message":"cfnetworking.policy-server.exited","data":{}}
{"timestamp":"2021-03-02T01:31:22.435053987Z","level":"info","source":"cfnetworking.policy-server","message":"cfnetworking.policy-server.getting db connection","data":{}}


Environment

Product Version: 2.9

Resolution

As mentioned in the BBR doc, the release job can have a pre-backup-lock script that stops any processes that could make changes to the components being backed up. So it's expected behavior that some jobs on instances implement a pre-backup-lock script that will be stopped by BBR which in turn is shown as failing by BOSH.

The table below shows TAS instances and their corresponding jobs which will be stopped by pre-backup-lock script:
 
InstanceJob
diego_braintps
diego_databasepolicy_server

cloud_controller  

cloud_controller_worker_local_1
cloud_controller_worker_local_2 
cloud_controller_ng
ccng_monit_http_healthcheck

cloud_controller_worker

cloud_controller_worker_1

clock_global

cc_deployment_updater
cloud_controller_clock
 
The following table shows instances and jobs which will be put into read-only mode by pre-backup-lock script:
 
InstanceJob
credhubcredhub
uaauaa
cloud_controllerrouting-api