How to utilize the new TAS control plane distributed tracing feature for troubleshooting
search cancel

How to utilize the new TAS control plane distributed tracing feature for troubleshooting

book

Article ID: 297482

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Distributed tracing existed for Applications in gorouter via the Zipkin Tracing feature for some time, but not for the TAS control plane. It can be difficult to debug Tanzu Application Services Control plane components such as Cloud Controller and Diego.  The new Distributed Tracing Feature Introduced in TAS 2.11.42, 2.13.24, 3.014, 4.0.5, and 5.0.0 allows for tracing control plane requests.  Prior to this feature if you wanted to trace something like a cf push failure through the component logs then all we have is the VCAP-ID which will allow you to trace a request in the gorouter log to the corresponding cloud controller instance that serviced it.  Tracing the work generated by the Cloud Controller can become challenging.  For example if the cf push fails because of a droplet upload then you need to review the logs from all the components mentioned in the diagram below called “droplet upload”. This is a complex workflow and stitching together the timestamps in order to correlate the request flow is inefficient.


Diagram description
: Diego will upload the generated application container droplet, after diego cell completes the staging process, to the diego brain file-server process.  The file-server process then forwards the droplet to the cloud controller.  Cloud controller will proceed to upload the droplet to nfs-blobstore ( or configured S3 object store ).

Resolution

This guide will describe how to trace a failed droplet upload using the distributed tracing feature and an example demonstrating how to identify the trace_id and span_id in logs files. For context in this example we manually corrupted the nfs-blobstore nginx configuration so that it blocks PUT requests from the cloud controller.  This will allow Cloud Controller to download droplets but not allow it to upload them.
 

NOTE: The trace_id and span_id will not be found in all component logs.  This feature specifically targets create and update flows between Cloud Controller and Diego components. Also UAA catalina logs will now print trace ids in TAS 3.0.16, 4.0.8, 5.0.0. 

What Logs do I need and where is the Trace ID?

These are the log bundles and command outputs required to fully take advantage of the Control Plane Distributed Tracing feature.

  • cf cli verbose output “-v”
  • cf app logs
  • Gorouter log bundle
  • Cloud Controller log bundle
  • Cloud Controller Worker log bundle
  • Diego Database log bundle
  • Diego Brain log bundle
  • Diego Cell log bundle.  
  • Uaa log bundle

 

Note: It is impractical to collect all diego cell logs for cf push failure events. You can easily find which diego cell was used by searching the app logs as per the following example.

cf logs mytestapp --recent  | egrep -e "\[.*?\] OUT Cell"  | awk '{print "bosh logs cell/"$5}' | sort -u

bosh logs cell/85a365d2-c0a4-4582-86f7-e75a124d93b4


Combing through logs can be a tedious exercise and syslog aggregators can help a lot.  If you have access to something like Tanzu Insights which consolidate all the platform logs into one single source then there is only one string you need to search for and that is the trace-id of the staging request.  An example of a trace ID string is “e3ae77c2fc554e4f54e907f30f299586”

cf cli verbose output shows the trace id as

X-B3-Traceid: e3ae77c2fc554e4f54e907f30f299586

Gorouter access log shows the trace id as

x_b3_traceid:"e3ae77c2fc554e4f54e907f30f299586"

Cloud Controller shows the trace id as

"b3_trace_id":"e3ae77c2fc554e4f54e907f30f299586”

Diego Components shows the trace id as

"b3_trace_id":"e3ae77c2fc554e4f54e907f30f299586"

UAA Catalina log

"traceId":"602938bd2f44431658290abdefddc33f"

Droplet upload failed example

Lets test the distributed tracing feature with the following context that we discussed so far where droplet download works, but upload is broken.  Given we already know droplet upload is broken we chose an existing application to restage in order to simulate a failure where a diego cell fails at some point along the path of a droplet upload.  

We start by restaging an existing application and observe the following error ( with debug output removed for brevity )

cf -v restage simple-http
 
Error staging application: StagingError - Staging error: staging failed
FAILED

In the cf cli debug output we would find a /v3/builds POST request that triggers the staging task 

REQUEST: [2024-02-09T16:13:39Z]
POST /v3/builds HTTP/1.1
Host: api.run-05.slot-##.#####-###-####.#####.com
Accept: application/json
Authorization: [PRIVATE DATA HIDDEN]
Content-Type: application/json
User-Agent: cf/8.3.0+e6f8a85.2022-03-11 (go1.17.7; amd64 linux)
{
  "package": {
    "guid": "ffb228db-7be3-43c3-a0e3-5a6eb8aeff68"
  }
}

cf cli will continue to send other requests to cloud controller while it waits for the POST request to return. We should look for 201 Created response returned back at some point later.  Even though staging failed, this request is handled asynchronously which allows the cloud controller to respond with HTTP 201 while it works to inform Diego of the staging request.  In this response we will find the the X-B3-Traceid “42edbf9921b4466f7b6d6caaafb27495

RESPONSE: [2024-02-09T16:13:39Z]
HTTP/1.1 201 Created
Content-Type: application/json; charset=utf-8
Date: Fri, 09 Feb 2024 16:13:39 GMT
Referrer-Policy: strict-origin-when-cross-origin
Server: nginx
Vary: Accept
X-B3-Spanid: 7b6d6caaafb27495
X-B3-Traceid: 42edbf9921b4466f7b6d6caaafb27495
X-Content-Type-Options: nosniff
X-Download-Options: noopen
X-Frame-Options: SAMEORIGIN
X-Permitted-Cross-Domain-Policies: none
X-Runtime: 0.325009
X-Vcap-Request-Id: 42edbf99-21b4-466f-7b6d-6caaafb27495::85343b4a-ee94-4a10-8a29-c36604365924
X-Xss-Protection: 1; mode=block

.
.
.
  "package": {
    "guid": "ffb228db-7be3-43c3-a0e3-5a6eb8aeff68"
  },
.
.
.
  },
  "staging_disk_in_mb": 8192,
  "staging_log_rate_limit_bytes_per_second": 16384,
  "staging_memory_in_mb": 1024,
  "state": "STAGING",
  "updated_at": "2024-02-09T16:13:39Z"

We can take the trace id and search a log aggregation tool to find all the relevant logs associated with this staging request.  In this example we will download all the log bundles mentioned above, extract them into a folder and scan them for the trace id. 

Note: Do not output build-request.log in the same folder you are scanning. This could result in filling your hard drive or duplicate logs found should find command scan the build-request.log. 

find ./extracted/ -type f | xargs egrep --no-filename 42edbf9921b4466f7b6d6caaafb27495 > /tmp/build-request.log

Then we can read the logs in chronological order

cat /tmp/build-request.log | egrep “^{“ | jq | jq -cs 'sort_by(.timestamp) | .[]'

In this example we want to look for the staging request GUID.  This represents the staging task ID that will be referenced in all the logs for this task.

{"timestamp":"2024-02-09T16:13:39.368017978Z","message":"stage.request","log_level":"info","source":"cc.bbs.stager_client","data":{"request_guid":"42edbf99-21b4-466f-7b6d-6caaafb27495::85343b4a-ee94-4a10-8a29-c36604365924","user_guid":"cd2f7ac2-6b58-4120-afae-7bdb24747d66","b3_trace_id":"42edbf9921b4466f7b6d6caaafb27495","b3_span_id":"7b6d6caaafb27495","staging_guid":"336c48f5-cff8-4c53-9fc1-c272c5b0e663"},"thread_id":58300,"fiber_id":58320,"process_id":7,"file":"/var/vcap/data/packages/cloud_controller_ng/3734a3dbc303a5048761d744d7fda53729e26013/cloud_controller_ng/lib/cloud_controller/diego/bbs_stager_client.rb","lineno":17,"method":"stage"}

With the staging GUID we can scan the logs bundles for why this task failed

find ./extracted/ -type f | xargs egrep --no-filename 336c48f5-cff8-4c53-9fc1-c272c5b0e663 > /tmp/staging.log

Check the staging logs for errors and we can quickly find the cloud controller received 500 http status code from the blobstore. At this point in the investigation we know cloud controller is not able to communicate with blobstore and we can start isolating the root problem.

 cat /tmp/staging.log | egrep "^{" |  jq | jq -cs 'sort_by(.timestamp) | .[]' | grep "log_level\":\"error"

{"timestamp":"2024-02-09T16:14:20.681198698Z","message":"diego.staging.completion-controller.staging-failed","log_level":"error","source":"cc.api","data":{"request_guid":"cf5e6ed9-dca2-491c-b28b-044c858dfc2c","task_guid":"336c48f5-cff8-4c53-9fc1-c272c5b0e663","failed":true,"failure_reason":"Upload failed: Status code 500","result":"","created_at":1707495219542155000},"thread_id":55800,"fiber_id":55820,"process_id":7,"file":"/var/vcap/data/packages/cloud_controller_ng/3734a3dbc303a5048761d744d7fda53729e26013/cloud_controller_ng/app/controllers/internal/staging_completion_controller.rb","lineno":94,"method":"parse_bbs_task_callback"}