How to correlate the subscription id with the nozzle

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Slow consumers can be attributed with log loss. This KB will explain on how you can identify a slow consumer/nozzle using subscription id

Environment

Product Version: 2.11

Resolution

First let us understand what is a Subscription ID:

Subscription ID: Also known as the Shard ID. This is typically set by the person configuring the nozzle when it is setup. If different teams are responsible for deploying the nozzles they should be able to share the subscription id that they have configured. For example Filebeat nozzle. Filebeat Cloud Foundry input subscription id defaults to a UUID but is configurable .

If you are unable to link the subscription id to the team responsible for the nozzle you may still be able to identify the team based on the remote ip.

Tracing for v1 Nozzles

Doppler drops

Doppler nodes will log the subscription id in doppler.stderr.log against drops that occur:

12022-12-22T20:34:57.244097852Z Dropped 1000 envelopes (v1 buffer) ShardID: my-shard-id

Slow Consumers

For a v1 nozzle the traffic controller will both emit a metric and log when a slow consumer is detected. These will contain a subscription id and a remote ip.

Slow Consumer Metric

Here's an example showing a slow consumer metric that includes both the subscription id ('my-shard-id') and the real remote ip (the X-Forwarded-For header value):

cf install-plugin log-cache
$ cf tail traffic_controller --envelope-type event --json | grep 'slow consumer' | jq -r '.batch | last | .event.body'
Remote Address: 10.0.4.15:33498
X-Forwarded-For: 203.0.113.10
Path: /firehose/my-shard-id?

When v1 Loggregator detects a slow connection, that connection is disconnected to prevent back pressure on the system. This may be due to improperly scaled nozzles, or slow user connections to v1 Loggregator.

Slow Consumer Logs

You can alternatively see the subscription id and remote ip in the traffic controller logs. For example:

$ bosh ssh loggregator_trafficcontroller 'sudo grep "Slow Consumer" /var/vcap/sys/log/loggregator_trafficcontroller/loggregator_trafficcontroller.stderr.log' | grep stdout
loggregator_trafficcontroller/b9481558-468a-4aa3-ba38-3b56164be9ac: stdout | 2022-12-22T19:05:27.460094701Z Doppler Proxy: Slow Consumer from 10.0.4.15:33498 using /firehose/my-shard-id?

This IP isn't particularly useful because it's normally the IP address of the gorouter. However you can search for the corresponding entry in the gorouter access log to determine the real remote ip.

$ bosh ssh router 'grep my-shard-id /var/vcap/sys/log/gorouter/access.log' | grep stdout
router/7db6747a-8644-4d20-bb7a-f156f93b3796: stdout | doppler.sys.example.com:443 - [2022-12-22T19:04:48.106031549Z] "GET /firehose/my-shard-id? HTTP/1.1" 101 0 0 "-" "Go-http-client/1.1" "203.0.113.10:51932" "10.0.4.19:8081" x_forwarded_for:"203.0.113.10" x_forwarded_proto:"https" vcap_request_id:"67b10922-1a2f-4478-5679-db8f3aecada7" response_time:158.438550 gorouter_time:158.438550 app_id:"-" app_index:"-" instance_id:"371a03ca-2985-4933-62df-392ec3705721" x_cf_routererror:"-" x_b3_traceid:"7c9b5196a2545c820194948ee486a28f" x_b3_spanid:"0194948ee486a28f" x_b3_parentspanid:"-" b3:"7c9b5196a2545c820194948ee486a28f-0194948ee486a28f"

An entry will only be visible in the gorouter access log when the websocket connection from the nozzle to the gorouter has been closed.

Seeing a slow consumer metric/log is an indication that the nozzle is very unresponsive and/or scaling is necessary.

Tracing for v2 Nozzles

v2 Loggregator took a different approach in regards to handling slow consumers. v2 nozzles can connect directly to the reverse log proxy (rlp) via gRPC or to the reverse log proxy gateway (rlpg) over long lived http connections. Instead of disconnecting the slow consumer, v2 loggregator drops at the rlp buffer per subscription id when that nozzle is not ingesting as fast as the loggregator is dispensing. This means that we are more likely to see rlp drops instead of doppler drops for slower v2 consumers as rlp is designed to continuously ingest even if the downstream is slow.

If the nozzle is an external v2 consumer then it will leverage the rlpg. The above v1 method should apply similarly as the rlpg does log the source-id of its consumers in the path of the request as well as the gorouter IP. For example, this is a log from file reverse_log_proxy_gateway.stdout.log:

10.225.61.37 - - [17/Jan/2023:15:42:15 +0000] "GET /v2/read?shard_id=external-v2-nozzle-sub-id&log&counter&event&gauge&timer HTTP/1.1" 200 4113754

In the above log snippit 10.225.61.37 is a gorouter IP and the subscription id is external-v2-nozzle-sub-id.

If the nozzle is an external v2 consumer then it will connect directly to the rlp. Currently source-id mappings are not available in the reverse_log_proxy job logs and the method for tracking these v2 nozzles will include identifying the IPs connected to the rlp job. For example, lets say we are investigating v2 log loss and we need to track that consumer:

1 - Find the index of the rlp that is dropping most by tracking the dropped metric in your metric ingestion system:

In this example picture, we can see that rlp on loggregator_trafficcontroller/b588ed10-179a-4bb4-874d-f6c7b031ddc3 rlp is experiencing most loss.

2 - ssh on the instance and find the connected IP addresses:

loggregator_trafficcontroller/b588ed10-179a-4bb4-874d-f6c7b031ddc3:/var/vcap/sys/log/reverse_log_proxy_gateway# netstat -ant | grep $(ifconfig eth0 | grep "inet addr" | cut -d ':' -f 2 | cut -d ' ' -f 1):8082
tcp        0      0 10.225.61.15:8082       10.84.231.49:55309      ESTABLISHED

This will give us a list of connected internal v2 nozzles. In this example, there is only one connected v2 nozzle at IP 10.84.231.49. We would then identify that IP. Please note that the output of this command may return many connected v2 consumers. We must then consider each connected consumer and follow up on each one individually.