This guide will describe how to trace a failed droplet upload using the distributed tracing feature and an example demonstrating how to identify the trace_id and span_id in logs files. For context in this example we manually corrupted the nfs-blobstore nginx configuration so that it blocks PUT requests from the cloud controller. This will allow Cloud Controller to download droplets but not allow it to upload them.
NOTE: The trace_id and span_id will not be found in all component logs. This feature specifically targets create and update flows between Cloud Controller and Diego components. Also UAA catalina logs will now print trace ids in TAS 3.0.16, 4.0.8, 5.0.0.
These are the log bundles and command outputs required to fully take advantage of the Control Plane Distributed Tracing feature.
Note: It is impractical to collect all diego cell logs for cf push failure events. You can easily find which diego cell was used by searching the app logs as per the following example.
cf logs mytestapp --recent | egrep -e "\[.*?\] OUT Cell" | awk '{print "bosh logs cell/"$5}' | sort -u bosh logs cell/85a365d2-c0a4-4582-86f7-e75a124d93b4
Combing through logs can be a tedious exercise and syslog aggregators can help a lot. If you have access to something like Tanzu Insights which consolidate all the platform logs into one single source then there is only one string you need to search for and that is the trace-id of the staging request. An example of a trace ID string is “e3ae77c2fc554e4f54e907f30f299586”
cf cli verbose output shows the trace id as
X-B3-Traceid: e3ae77c2fc554e4f54e907f30f299586
Gorouter access log shows the trace id as
x_b3_traceid:"e3ae77c2fc554e4f54e907f30f299586"
Cloud Controller shows the trace id as
"b3_trace_id":"e3ae77c2fc554e4f54e907f30f299586”
Diego Components shows the trace id as
"b3_trace_id":"e3ae77c2fc554e4f54e907f30f299586"
UAA Catalina log
"traceId":"602938bd2f44431658290abdefddc33f"
Lets test the distributed tracing feature with the following context that we discussed so far where droplet download works, but upload is broken. Given we already know droplet upload is broken we chose an existing application to restage in order to simulate a failure where a diego cell fails at some point along the path of a droplet upload.
We start by restaging an existing application and observe the following error ( with debug output removed for brevity )
cf -v restage simple-http Error staging application: StagingError - Staging error: staging failed FAILED
In the cf cli debug output we would find a /v3/builds POST request that triggers the staging task
REQUEST: [2024-02-09T16:13:39Z] POST /v3/builds HTTP/1.1 Host: api.run-05.slot-##.#####-###-####.#####.com Accept: application/json Authorization: [PRIVATE DATA HIDDEN] Content-Type: application/json User-Agent: cf/8.3.0+e6f8a85.2022-03-11 (go1.17.7; amd64 linux) { "package": { "guid": "ffb228db-7be3-43c3-a0e3-5a6eb8aeff68" } }
cf cli will continue to send other requests to cloud controller while it waits for the POST request to return. We should look for 201 Created response returned back at some point later. Even though staging failed, this request is handled asynchronously which allows the cloud controller to respond with HTTP 201 while it works to inform Diego of the staging request. In this response we will find the the X-B3-Traceid “42edbf9921b4466f7b6d6caaafb27495”
RESPONSE: [2024-02-09T16:13:39Z]
HTTP/1.1 201 Created
Content-Type: application/json; charset=utf-8
Date: Fri, 09 Feb 2024 16:13:39 GMT
Referrer-Policy: strict-origin-when-cross-origin
Server: nginx
Vary: Accept
X-B3-Spanid: 7b6d6caaafb27495
X-B3-Traceid: 42edbf9921b4466f7b6d6caaafb27495
X-Content-Type-Options: nosniff
X-Download-Options: noopen
X-Frame-Options: SAMEORIGIN
X-Permitted-Cross-Domain-Policies: none
X-Runtime: 0.325009
X-Vcap-Request-Id: 42edbf99-21b4-466f-7b6d-6caaafb27495::85343b4a-ee94-4a10-8a29-c36604365924
X-Xss-Protection: 1; mode=block
.
.
.
"package": {
"guid": "ffb228db-7be3-43c3-a0e3-5a6eb8aeff68"
},
.
.
.
},
"staging_disk_in_mb": 8192,
"staging_log_rate_limit_bytes_per_second": 16384,
"staging_memory_in_mb": 1024,
"state": "STAGING",
"updated_at": "2024-02-09T16:13:39Z"
We can take the trace id and search a log aggregation tool to find all the relevant logs associated with this staging request. In this example we will download all the log bundles mentioned above, extract them into a folder and scan them for the trace id.
Note: Do not output build-request.log in the same folder you are scanning. This could result in filling your hard drive or duplicate logs found should find command scan the build-request.log.
find ./extracted/ -type f | xargs egrep --no-filename 42edbf9921b4466f7b6d6caaafb27495 > /tmp/build-request.log
Then we can read the logs in chronological order
cat /tmp/build-request.log | egrep “^{“ | jq | jq -cs 'sort_by(.timestamp) | .[]'
In this example we want to look for the staging request GUID. This represents the staging task ID that will be referenced in all the logs for this task.
{"timestamp":"2024-02-09T16:13:39.368017978Z","message":"stage.request","log_level":"info","source":"cc.bbs.stager_client","data":{"request_guid":"42edbf99-21b4-466f-7b6d-6caaafb27495::85343b4a-ee94-4a10-8a29-c36604365924","user_guid":"cd2f7ac2-6b58-4120-afae-7bdb24747d66","b3_trace_id":"42edbf9921b4466f7b6d6caaafb27495","b3_span_id":"7b6d6caaafb27495","staging_guid":"336c48f5-cff8-4c53-9fc1-c272c5b0e663"},"thread_id":58300,"fiber_id":58320,"process_id":7,"file":"/var/vcap/data/packages/cloud_controller_ng/3734a3dbc303a5048761d744d7fda53729e26013/cloud_controller_ng/lib/cloud_controller/diego/bbs_stager_client.rb","lineno":17,"method":"stage"}
With the staging GUID we can scan the logs bundles for why this task failed
find ./extracted/ -type f | xargs egrep --no-filename 336c48f5-cff8-4c53-9fc1-c272c5b0e663 > /tmp/staging.log
Check the staging logs for errors and we can quickly find the cloud controller received 500 http status code from the blobstore. At this point in the investigation we know cloud controller is not able to communicate with blobstore and we can start isolating the root problem.
cat /tmp/staging.log | egrep "^{" | jq | jq -cs 'sort_by(.timestamp) | .[]' | grep "log_level\":\"error"
{"timestamp":"2024-02-09T16:14:20.681198698Z","message":"diego.staging.completion-controller.staging-failed","log_level":"error","source":"cc.api","data":{"request_guid":"cf5e6ed9-dca2-491c-b28b-044c858dfc2c","task_guid":"336c48f5-cff8-4c53-9fc1-c272c5b0e663","failed":true,"failure_reason":"Upload failed: Status code 500","result":"","created_at":1707495219542155000},"thread_id":55800,"fiber_id":55820,"process_id":7,"file":"/var/vcap/data/packages/cloud_controller_ng/3734a3dbc303a5048761d744d7fda53729e26013/cloud_controller_ng/app/controllers/internal/staging_completion_controller.rb","lineno":94,"method":"parse_bbs_task_callback"}