Cloud Controller job_queue_length is increasing and cloud controller worker logs show CF-JobTimeout errors
search cancel

Cloud Controller job_queue_length is increasing and cloud controller worker logs show CF-JobTimeout errors

book

Article ID: 433325

calendar_today

Updated On:

Products

VMware Tanzu Application Service

Issue/Introduction

 

You might observe timeout errors when job_queue_length metric is increasing.

{"timestamp":"2026-03-12T16:42:36.760496288Z","message":"Request failed: 524: {\"description\"=\u003e\"The job execution has timed out.\", \"error_code\"=\u003e\"CF-JobTimeout\", \"code\"=\u003e290006, \"test_mode_info\"=\u003e{\"description\"=\u003e\"The job execution has timed out.\", \"error_code\"=\u003e\"CF-JobTimeout\", \"backtrace\"=\u003e[\"/var/vcap/data/packages/cloud_controller_ng/GUID/cloud_controller_ng/app/jobs/timeout_job.rb:18:in `rescue in perform'\"

 

The example below shows cloud controller worker starting a BlobstoreDelete job and it eventually fails 4 hours later.  The default job timeout for cloud controller worker is 4 hours. 

{"timestamp":"2026-02-27T05:36:23.843405396Z","message":"2026-02-27T05:36:23+0000: [Worker(cc_global_worker.cloud_controller_worker.2.1)] Job VCAP::CloudController::Jobs::Runtime::BlobstoreDelete (id=72451960) (queue=cc-generic) RUNNING","log_level":"info","source":"cc-worker","data":{"worker_name":"cc_global_worker.cloud_controller_worker.2.1"},"thread_id":14340,"fiber_id":14360,"process_id":6,"file":"/var/vcap/data/packages/cloud_controller_ng/GUID/gem_home/ruby/3.2.0/gems/delayed_job-4.1.13/lib/delayed/worker.rb","lineno":286,"method":"say"}

{"timestamp":"2026-02-27T09:36:23.942161756Z","message":"2026-02-27T09:36:23+0000: [Worker(cc_global_worker.cloud_controller_worker.2.1)] Job VCAP::CloudController::Jobs::Runtime::BlobstoreDelete (id=72451960) (queue=cc-generic) FAILED (0 prior attempts) with CloudController::Errors::ApiError: The job execution has timed out.","log_level":"error","source":"cc-worker","data":{"worker_name":"cc_global_worker.cloud_controller_worker.2.1"},"thread_id":14340,"fiber_id":14360,"process_id":6,"file":"/var/vcap/data/packages/cloud_controller_ng/GUID/gem_home/ruby/3.2.0/gems/delayed_job-4.1.13/lib/delayed/worker.rb","lineno":286,"method":"say"}

 

after 4 hours the job will be retried and in this case was successful

{"timestamp":"2026-02-27T09:36:29.065131358Z","message":"2026-02-27T09:36:29+0000: [Worker(cc_global_worker.cloud_controller_worker.1.1)] Job VCAP::CloudController::Jobs::Runtime::BlobstoreDelete (id=72451960) (queue=cc-generic) RUNNING","log_level":"info","source":"cc-worker","data":{"worker_name":"cc_global_worker.cloud_controller_worker.1.1"},"thread_id":14340,"fiber_id":14360,"process_id":6,"file":"/var/vcap/data/packages/cloud_controller_ng/GUID/gem_home/ruby/3.2.0/gems/delayed_job-4.1.13/lib/delayed/worker.rb","lineno":286,"method":"say"}

{"timestamp":"2026-02-27T09:36:29.268478202Z","message":"2026-02-27T09:36:29+0000: [Worker(cc_global_worker.cloud_controller_worker.1.1)] Job VCAP::CloudController::Jobs::Runtime::BlobstoreDelete (id=72451960) (queue=cc-generic) COMPLETED after 0.2030","log_level":"info","source":"cc-worker","data":{"worker_name":"cc_global_worker.cloud_controller_worker.1.1"},"thread_id":14340,"fiber_id":14360,"process_id":6,"file":"/var/vcap/data/packages/cloud_controller_ng/GUID/gem_home/ruby/3.2.0/gems/delayed_job-4.1.13/lib/delayed/worker.rb","lineno":286,"method":"say"}

 

Environment

Third Party s3 blobstore

Cause

cloud controller worker uses the fog-aws client which does not enable TCP keepalives when creating a socket with the external s3 blobstore.  This problem occurs when the cloud controller worker successfully opens a TCP socket with s3 and sends a HTTP request.  While waiting for the HTTP response the external s3 blobstore drops the TCP socket without closing.  Given the cloud controller worker does not know the TCP socket is closed it will wait for 4 hours until the job timeout is reached. 

You will observe cloud controller worker has a healthy long running TCP session with "ss -ntpo" command. note there are not timers running on this socket.

ESTAB 0      0        CC-WORKER:48722 S3-BLOBSTORE:443   users:(("ruby",pid=509646,fd=26))

 

This is an example of a tcpsocket with keepalive timer set when cloud controller worker communicates with mysql database.  

ESTAB 0      0        CC-WORKER:33966  MYSQL:3306  users:(("ruby",pid=532000,fd=7)) timer:(keepalive,22sec,0) 

 

Resolution

In a future release of EAR the cloud controller worker will set a keepalive timer on sockets used to connect to external s3 blobstore.  This will allow the cloud controller worker to fail and retry faster when there is an external problem.