When pushing an application onto the TAS foundation, it would intermittently fail with a staging failure. The problem was always about uploading a droplet, as shown below.
2023-05-23T10:24:36.02+0800 [STG/0] OUT Uploading droplet, build artifacts cache... 2023-05-23T10:24:36.02+0800 [STG/0] OUT Uploading build artifacts cache... 2023-05-23T10:24:36.02+0800 [STG/0] OUT Uploading droplet... 2023-05-23T10:24:36.05+0800 [STG/0] OUT Uploaded build artifacts cache (129B) 2023-05-23T10:24:36.53+0800 [API/2] OUT Creating droplet for app with guid 69c6####-####-####-####-####9e3c 2023-05-23T10:24:38.09+0800 [API/0] OUT Creating droplet for app with guid 69c6####-####-####-####-####9e3c 2023-05-23T10:24:39.64+0800 [STG/0] ERR Failed to upload payload for droplet 2023-05-23T10:24:39.65+0800 [STG/0] ERR Uploading failed 2023-05-23T10:24:39.86+0800 [STG/0] OUT Cell 7cd9####-####-####-####-####b6ba stopping instance 3a02####-####-####-####-####9be6 2023-05-23T10:24:39.86+0800 [STG/0] OUT Cell 7cd9####-####-####-####-####b6ba destroying container for instance 3a02####-####-####-####-####9be6 2023-05-23T10:24:39.91+0800 [API/2] ERR Failed to stage build: staging failed 2023-05-23T10:24:40.14+0800 [STG/0] OUT Cell 7cd9####-####-####-####-####b6ba successfully destroyed container for instance 3a02####-####-####-####-####9be6
Rep job on the diego_cell instance reported error "Upload failed: Status code 500".
{...,"level":"error","source":"rep","message":"rep.URLUploader.failed-uploading","data":{"attempt":1,"error":"Upload failed: Status code 500","fileLocation":"/var/vcap/data/rep/tmp/executor-work/compressed3308575130","session":"2"}}
Same error "Upload failed: Status code 500" was also seen cloud_controller logs
{...,"message":"diego.staging.completion-controller.staging-failed","log_level":"error","source":"cc.api","data":{"request_guid":"4175####-####-####-####-####909d","task_guid":"db7b297e-c5ed-4cd7-94e0-acca58ad9f46","failed":true,"failure_reason":"Upload failed: Status code 500","result":"","created_at":1684719607376265297},"thread_id":65260,"fiber_id":65280,"process_id":6,"file":"/var/vcap/data/packages/cloud_controller_ng/16d38133417d769a5a8d280882646b0cd91e1ce5/cloud_controller_ng/app/controllers/internal/staging_completion_controller.rb","lineno":108,"method":"parse_bbs_task_callback"}
On this particular cloud_controller instance ccng_monit_http_healthcheck job had tried to restart cloud_controller job due to health check failure.
2023-05-17 16:08:58.766594523+00:00 Will restart CC over on repeated failures 2023-05-17 16:08:58.771876535+00:00 ccng_monit_http_healthcheck failed to curl <https://###.###.###.###:9024/healthz>: exit code 7 2023-05-17 16:08:58.772963058+00:00 :: Healthcheck failed consistently, restarting CC 2023-05-17 16:09:09.654327192+00:00 Will restart CC over on repeated failures
But post restart of cloud_controller job ccng_monit_http_healthcheck kept reporting missing of cloud_controller.sock file.
==> nginx-error.log 2023/05/17 16:09:09 [crit] 8#0: *1 connect() to unix:/var/vcap/data/cloud_controller_ng/cloud_controller.sock failed (2: No such file or directory) while connecting to upstream, client: ###.###.###.###, server: _, request: "GET /healthz HTTP/1.1", upstream: "http://unix:/var/vcap/data/cloud_controller_ng/cloud_controller.sock:/healthz", host: "###.###.###.###:9024"
Also post restart cloud_controller process didn't produce any more logs about handling client requests except MySQL connection error.
{"timestamp":"2023-05-17T16:09:59.724590545Z","message":"2023-05-17T16:09:59+0000: [Worker(cc_api_worker.cloud_controller.3.2)] Error while reserving job: Mysql2::Error::ConnectionError: MySQL server has gone away","log_level":"info","source":"cc-worker","data":{},"thread_id":64420,"fiber_id":64480,"process_id":6,"file":"/var/vcap/data/packages/cloud_controller_ng/16d38133417d769a5a8d280882646b0cd91e1ce5/gem_home/ruby/2.7.0/gems/delayed_job-4.1.8/lib/delayed/worker.rb","lineno":285,"method":"say"}
route_registra job also reported failure in connecting to cloud_controller and unregistered the route.
==> route_registrar.stdout.log
{"timestamp":"2023-05-17T16:08:54.086332536Z","level":"info","source":"Route Registrar","message":"Route Registrar.Script exited with error","data":{"error":"exit status 7","script":"/var/vcap/jobs/c
loud_controller_ng/bin/cloud_controller_ng_health_check","stderr":" % Total % Received % Xferd Average Speed Time Time Time Current\n Dload Upload To
tal Spent Left Speed\n\r 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port 9024: Connection refused\n","stdout":"2023-05
-17 16:08:54.075543263+00:00 cloud_controller_ng_health_check: route-registrar about to curl https://localhost:9024/healthz\n2023-05-17 16:08:54.085880084+00:00 Failed to hit https://localhost:9024/h
ealthz\n"}}
{"timestamp":"2023-05-17T16:08:54.086377049Z","level":"info","source":"Route Registrar","message":"Route Registrar.healthchecker returned unhealthy for route","data":{"route":{"Type":"","Name":"api",
"Port":null,"TLSPort":9024,"Tags":{"component":"CloudController"},"URIs":["api.sys.example.com"],"RouterGroup":"","Host":"","ExternalPort":null,"RouteServiceUrl":"","RegistrationInterval
":20000000000,"HealthCheck":{"Name":"api-health-check","ScriptPath":"/var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_health_check","Timeout":6000000000},"ServerCertDomainSAN":"cloud-contro
ller-ng.service.cf.internal"}}}
{"timestamp":"2023-05-17T16:08:54.086397738Z","level":"info","source":"Route Registrar","message":"Route Registrar.Unregistering route","data":{"route":{"Type":"","Name":"api","Port":null,"TLSPort":9
024,"Tags":{"component":"CloudController"},"URIs":["api.sys.example.com"],"RouterGroup":"","Host":"","ExternalPort":null,"RouteServiceUrl":"","RegistrationInterval":20000000000,"HealthCh
eck":{"Name":"api-health-check","ScriptPath":"/var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_health_check","Timeout":6000000000},"ServerCertDomainSAN":"cloud-controller-ng.service.cf.inte
rnal"}}}
All these symptoms were indicating this particular cloud_controller instance was stuck at some point post restart and was rejecting any incoming request. Therefore, a 500 status code was returned by this cloud_controller when the rep reported completion of uploading the droplet.
Product Version: 2.11
Diego reported staging status to cloud_controller by calling its API via internal route (allocated by BOSH DNS). However, it's identified that the failing cloud_controller instance was not removed from internal routing. Hence, this weird staging problem occurred.
The temporary workaround is to restart the cloud_controller_ng job on the problematic instance.
# monit restart cloud_controller_ng
The permanent fix has also been committed into CAPI 1.109.20, which will be included in TAS 2.11.41+.
TAS 2.13, 3.x, 4.x already have the fix.