Orphaned application container due to Diego sync failure on Clock Global VM
search cancel

Orphaned application container due to Diego sync failure on Clock Global VM

book

Article ID: 297761

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

Under /var/vcap/jobs/cloud_controller_clock/config/certs/ of the clock_global virtual machine (VM), the uaa_ca.crt is empty.

Observe the file size from the below ls command:

/var/vcap/jobs/cloud_controller_clock/config/certs# ls -la

total 32

drwxr-x--- 2 root vcap 4096 Jan 14 18:11 .

drwxr-x--- 3 root vcap 4096 Jan 14 18:11 ..

-rw-r----- 1 root vcap 1209 Jan 14 18:07 credhub_ca.crt

-rw-r----- 1 root vcap    1 Jan 14 18:07 db_ca.crt

-rw-r----- 1 root vcap 1209 Jan 14 18:07 mutual_tls_ca.crt

-rw-r----- 1 root vcap 1327 Jan 14 18:07 mutual_tls.crt

-rw-r----- 1 root vcap 1680 Jan 14 18:07 mutual_tls.key

-rw-r----- 1 root vcap    2 Jan 14 18:07 uaa_ca.crt

This causes the cloud_controller_clock to fail to retrieve the User Account and Authentication (UAA) token. Eventually the cc.diego.sync.processes fails as well. 

{"timestamp":1548072501.0820189,"message":"error-updating-lrp-state","log_level":"error","source":"cc.diego.sync.processes","data":{"error":"OpenSSL::X509::StoreError","error_message":""},"thread_id":47217024603260,
"fiber_id":47217024554400,"process_id":9615,"file":"/var/vcap/data/packages/cloud_controller_ng/1e4b5398d290f36d4f16bf9d7eaea36362084be2/cloud_controller_ng/lib/cloud_controller/diego/processes_sync.rb","lineno":89,"method":"block in process_workpool_exceptions"} 

As a result, when performing the commands, cf delete or cf push on an application, the old application container may still exist in the Diego Cell. Their route info will still be submitted to the Gorouter through the router-emitter.


This can cause Gorouters to route some requests to old containers and this leads to unexpected behavior

Environment


Cause

When Transmission Control Protocol (TCP) routes are not used, the Cloud Controller API (CAPI) sync job (which runs in the cloud_controller_clock job) talks to the Diego Database directly. As a result, UAA cert is not required.


However when TCP routes are used, it needs to talk with the Routing API to determine whether the request needs TCP routing.


This is because the cloud_controller_clock job uses the same network library as the cloud_controller. In addition, the code path for the Routing API is the same.


As a result, when TCP routes are used, all requests, including internal, external, HTTP, or TCP, will always go through the Routing API for necessary checks.


cloud_controller_clock talks with the Routing API with a token granted by UAA, therefore it needs the proper certificate authority (CA) for UAA.


The problem is that the default release for the cloud_controller_ng job has the necessary UAA CA, while the cloud_controller_clock does not. 

Resolution

A temporary workaround is to copy the BOSH root CA certificate into the clock_global VM.

The original location is:

Operations Manager VM - /var/tempest/workspaces/default/root_ca_certificate

The target location is:

Clock Global VM - /var/vcap/jobs/cloud_controller_clock/config/certs/

The permanent fix will be released in PAS 2.2.12.