Push application occasionally failed with staging error about uploading droplet

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

When pushing application onto TAS foundation it would intermittently fail with staging failure. The problem was always about uploading droplet as shown below.

2023-05-23T10:24:36.02+0800 [STG/0] OUT Uploading droplet, build artifacts cache...
2023-05-23T10:24:36.02+0800 [STG/0] OUT Uploading build artifacts cache...
2023-05-23T10:24:36.02+0800 [STG/0] OUT Uploading droplet...
2023-05-23T10:24:36.05+0800 [STG/0] OUT Uploaded build artifacts cache (129B)
2023-05-23T10:24:36.53+0800 [API/2] OUT Creating droplet for app with guid 69c6624a-924c-468a-aabb-59cd7f069e3c
2023-05-23T10:24:38.09+0800 [API/0] OUT Creating droplet for app with guid 69c6624a-924c-468a-aabb-59cd7f069e3c
2023-05-23T10:24:39.64+0800 [STG/0] ERR Failed to upload payload for droplet
2023-05-23T10:24:39.65+0800 [STG/0] ERR Uploading failed
2023-05-23T10:24:39.86+0800 [STG/0] OUT Cell 7cd9a13d-d15b-4cb1-b41a-4c91cd53b6ba stopping instance 3a02cc7c-4e8c-4de8-8213-0343c56a9be6
2023-05-23T10:24:39.86+0800 [STG/0] OUT Cell 7cd9a13d-d15b-4cb1-b41a-4c91cd53b6ba destroying container for instance 3a02cc7c-4e8c-4de8-8213-0343c56a9be6
2023-05-23T10:24:39.91+0800 [API/2] ERR Failed to stage build: staging failed
2023-05-23T10:24:40.14+0800 [STG/0] OUT Cell 7cd9a13d-d15b-4cb1-b41a-4c91cd53b6ba successfully destroyed container for instance 3a02cc7c-4e8c-4de8-8213-0343c56a9be6

Rep job on the diego_cell instance reported error "Upload failed: Status code 500".

{...,"level":"error","source":"rep","message":"rep.URLUploader.failed-uploading","data":{"attempt":1,"error":"Upload failed: Status code 500","fileLocation":"/var/vcap/data/rep/tmp/executor-work/compressed3308575130","session":"2"}}

Same error "Upload failed: Status code 500" was also seen cloud_controller logs

{...,"message":"diego.staging.completion-controller.staging-failed","log_level":"error","source":"cc.api","data":{"request_guid":"41750c2d-c03a-4ba9-8c50-0ed258f7909d","task_guid":"db7b297e-c5ed-4cd7-94e0-acca58ad9f46","failed":true,"failure_reason":"Upload failed: Status code 500","result":"","created_at":1684719607376265297},"thread_id":65260,"fiber_id":65280,"process_id":6,"file":"/var/vcap/data/packages/cloud_controller_ng/16d38133417d769a5a8d280882646b0cd91e1ce5/cloud_controller_ng/app/controllers/internal/staging_completion_controller.rb","lineno":108,"method":"parse_bbs_task_callback"}

On this particular cloud_controller instance ccng_monit_http_healthcheck job had tried to restart cloud_controller job due to health check failure.

2023-05-17 16:08:58.766594523+00:00 Will restart CC over on repeated failures
2023-05-17 16:08:58.771876535+00:00 ccng_monit_http_healthcheck failed to curl <https://x.x.x.x:9024/healthz>: exit code 7
2023-05-17 16:08:58.772963058+00:00 :: Healthcheck failed consistently, restarting CC
2023-05-17 16:09:09.654327192+00:00 Will restart CC over on repeated failures

But post restart of cloud_controller job ccng_monit_http_healthcheck kept reporting missing of cloud_controller.sock file.

==> nginx-error.log
2023/05/17 16:09:09 [crit] 8#0: *1 connect() to unix:/var/vcap/data/cloud_controller_ng/cloud_controller.sock failed (2: No such file or directory) while connecting to upstream, client: x.x.x.x, server: _, request: "GET /healthz HTTP/1.1", upstream: "http://unix:/var/vcap/data/cloud_controller_ng/cloud_controller.sock:/healthz", host: "x.x.x.x:9024"

Also post restart cloud_controller process didn't produce any more logs about handling client requests except MySQL connection error.

{"timestamp":"2023-05-17T16:09:59.724590545Z","message":"2023-05-17T16:09:59+0000: [Worker(cc_api_worker.cloud_controller.3.2)] Error while reserving job: Mysql2::Error::ConnectionError: MySQL server has gone away","log_level":"info","source":"cc-worker","data":{},"thread_id":64420,"fiber_id":64480,"process_id":6,"file":"/var/vcap/data/packages/cloud_controller_ng/16d38133417d769a5a8d280882646b0cd91e1ce5/gem_home/ruby/2.7.0/gems/delayed_job-4.1.8/lib/delayed/worker.rb","lineno":285,"method":"say"}

route_registra job also reported failure in connecting to cloud_controller and unregistered the route.

==> route_registrar.stdout.log
{"timestamp":"2023-05-17T16:08:54.086332536Z","level":"info","source":"Route Registrar","message":"Route Registrar.Script exited with error","data":{"error":"exit status 7","script":"/var/vcap/jobs/c
loud_controller_ng/bin/cloud_controller_ng_health_check","stderr":"  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   To
tal   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 9024: Connection refused\n","stdout":"2023-05
-17 16:08:54.075543263+00:00 cloud_controller_ng_health_check: route-registrar about to curl https://localhost:9024/healthz\n2023-05-17 16:08:54.085880084+00:00 Failed to hit https://localhost:9024/h
ealthz\n"}}
{"timestamp":"2023-05-17T16:08:54.086377049Z","level":"info","source":"Route Registrar","message":"Route Registrar.healthchecker returned unhealthy for route","data":{"route":{"Type":"","Name":"api",
"Port":null,"TLSPort":9024,"Tags":{"component":"CloudController"},"URIs":["api.sys.pcfqa.shanghaionstar.com"],"RouterGroup":"","Host":"","ExternalPort":null,"RouteServiceUrl":"","RegistrationInterval
":20000000000,"HealthCheck":{"Name":"api-health-check","ScriptPath":"/var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_health_check","Timeout":6000000000},"ServerCertDomainSAN":"cloud-contro
ller-ng.service.cf.internal"}}}
{"timestamp":"2023-05-17T16:08:54.086397738Z","level":"info","source":"Route Registrar","message":"Route Registrar.Unregistering route","data":{"route":{"Type":"","Name":"api","Port":null,"TLSPort":9
024,"Tags":{"component":"CloudController"},"URIs":["api.sys.pcfqa.shanghaionstar.com"],"RouterGroup":"","Host":"","ExternalPort":null,"RouteServiceUrl":"","RegistrationInterval":20000000000,"HealthCh
eck":{"Name":"api-health-check","ScriptPath":"/var/vcap/jobs/cloud_controller_ng/bin/cloud_controller_ng_health_check","Timeout":6000000000},"ServerCertDomainSAN":"cloud-controller-ng.service.cf.inte
rnal"}}}

All these symptoms were indicating this particular cloud_controller instance was stuck at some point post restart and was rejecting any incoming request. Therefore 500 status code was returned by this cloud_controller when rep reported completion of uploading droplet.

Environment

Product Version: 2.11

Resolution

Diego reported staging status to cloud_controller by calling its API via internal route (allocated by BOSH DNS). However it's identified that the failing cloud_controller instance was not removed from internal routing. Hence this weird staging problem occurred.

The temporary workaround is to restart cloud_controller_ng job on problematic instance.

# monit restart cloud_controller_ng

The permanent fix has also been committed into CAPI 1.109.20, which will be included in TAS 2.11.41+.
TAS 2.13, 3.x, 4.x already have the fix.