Log loss occurring in TAS due to back pressure from unused app nozzles

Products

VMware Tanzu Application Service

Issue/Introduction

This KB article illustrates the scenario in which log loss i.e dropped envelopes occur due to the use of outdated, unused application nozzles. Using outdated CF CLI plugins like the FirehosePlugin or having idle or unused nozzles can also contribute to increased pressure on the Loggregator VMs, and eventually, log loss / dropped envelopes.

This KB article also assumes that you have already adequately scaled the Loggregator, Log Cache, and Doppler VMs per the documentation, and are still experiencing log loss / envelope drops. The ideal ratio of Dopplers to Loggregator to Nozzles should be 2:1:1. For example, if you have 40 Dopplers, 20 Loggregators, and 20 splunk / firehose nozzles, this matches the ideal ratio. 40:20:20 = 2:1:1

Symptoms of this issue include the following:

1. Dropped envelopes in the Doppler.stderr.log (from the Doppler VM) where the ShardID indicate the nozzle where the drops are originating from. Based on the log output below, the drops are occurring in the new relic nozzles and the FirehosePlugin:

2024-11-26T15:18:43.511264556Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:44.067754158Z Dropped 1000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:45.312025078Z Dropped 3000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:46.511093603Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:47.280569372Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:48.697958693Z Dropped 4000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:48.934935543Z Dropped 1000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:49.799814328Z Dropped 1000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:50.671530452Z Dropped 3000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:51.259809542Z Dropped 1000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:52.120642192Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:53.053222981Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:54.686482713Z Dropped 4000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:55.818981168Z Dropped 3000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:56.392525947Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:57.017061080Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-12-03T16:21:45.694002017Z Dropped 2000 envelopes (v2 buffer) ShardID: newrelic.firehose

2. Assuming we have Healthwatch installed, when we check the Healthwatch dashboard (Logging & Metrics Pipeline > RLP Message Loss Rate), we observe a high RLP Message loss rate:

Resolution

There are a few options we can consider to mitigate this issue:

1. Implement Aggregate Syslog Forwarding in TAS. This removes the need for using nozzles like Splunk nozzle or firehose-to-syslog nozzles altogether. More information about configuring Aggregate Syslog can be found here under the section "Aggregate Syslog Forwarding"

2. Remove unused nozzles or nozzles that indicate envelope drops. In our example below, we see that the new relic nozzle is contributing to dropped envelopes / log loss which can be seen in the Doppler.stderr.log file:

2024-11-26T15:18:43.511264556Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:44.067754158Z Dropped 1000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:45.312025078Z Dropped 3000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:46.511093603Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:47.280569372Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:48.697958693Z Dropped 4000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:48.934935543Z Dropped 1000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:49.799814328Z Dropped 1000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:50.671530452Z Dropped 3000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:51.259809542Z Dropped 1000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:52.120642192Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:53.053222981Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:54.686482713Z Dropped 4000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:55.818981168Z Dropped 3000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:56.392525947Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:57.017061080Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-12-03T16:21:45.694002017Z Dropped 2000 envelopes (v2 buffer) ShardID: newrelic.firehose

In this case, we can log into Apps Manager, and search for the keywords "nozzle" or "firehose":

From here, we can turn off the unnecessary newrelic nozzle or delete them altogether.

3. Uninstall unused, outdated CF CLI plugins that are contributing to envelope drops. In the Doppler.stderr.log file below, we can see envelopes being dropped via the FirehosePlugin:

2024-11-26T15:18:43.511264556Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:44.067754158Z Dropped 1000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:45.312025078Z Dropped 3000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:46.511093603Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:47.280569372Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:48.697958693Z Dropped 4000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:48.934935543Z Dropped 1000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:49.799814328Z Dropped 1000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:50.671530452Z Dropped 3000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:51.259809542Z Dropped 1000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:52.120642192Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:53.053222981Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:54.686482713Z Dropped 4000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:55.818981168Z Dropped 3000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:56.392525947Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-11-26T15:18:57.017061080Z Dropped 2000 envelopes (v1 buffer) ShardID: FirehosePlugin
2024-12-03T16:21:45.694002017Z Dropped 2000 envelopes (v2 buffer) ShardID: newrelic.firehose

Given that the FirehosePlugin is unused, we can uninstall this CF CLI plugin that is contributing to the log loss / dropped envelopes via the command below:

cf uninstall-plugin FirehosePlugin

4. Consider scaling down the number of nozzles. At times, having more nozzle instances can contribute to increased pressure on the Loggregator VMs i.e log loss. For example, if we have 40 Dopplers, 20 Loggregators, and 54 instances of the Splunk Nozzle, we can scale down the number of Splunk Nozzle Instances from 54 to 20 instances to match the number of Loggregator VM instances (which is 20 in this example).

As an example, we scale down the number of Splunk Nozzles from 54 to 20 and then run an apply changes only on the Splunk Firehose nozzle tile:

If you are using something like the newrelic or firehose-to-syslog nozzle, you can scale down the number of nozzles via the CF CLI command below:

cf scale $NOZZLE_APP_NAME -i <NUMBER>

5. Check if there are any noisy applications and consider reducing the logging output via these steps below:
- 1. Install the log-cache CLI plugin via the command below:

cf install-plugin -r CF-Community "log-cache"

- 2. Run the log-meta command to identify the logging output of applications running on the foundation:

cf log-meta --guid --noise --sort-by rate | tee noisy_apps.txt

- 3. Analyze the the contents of the noisy_apps.txt file. Note that the applications with the largest logging output i.e "noisy" apps are the ones at the bottom of the file:

8786f93e-c12d-4c21-82bb-5e3b57297e41 100000 2094164230 1m17s 106514
d378d7ff-1359-4fc4-82a7-604f55080fbe 100000 1641340419 53s 120295
c0e7e615-c82a-41ae-bdd5-3370444abfd6 100000 2241700486 45s 127611
984c4073-5c62-47e9-9d06-71c124a06203 100000 2682335689 1m5s 129968
system_metrics_agent 100000 2698345127 32s 143044
6b4fcead-8631-4e3b-887a-c8aababe56db 100000 4012195180 28s 251197
gorouter 100000 10336679854 14s 543042

As seen in the above output snippet of noisy_apps.txt, our noisiest app is the app that has the GUID of 6b4fcead-8631-4e3b-887a-c8aababe56db which has a log output of 251,197 logs per minute or roughly 4,186 logs per second (251,197 logs / 60 seconds).

- 4. Next, we can run the command below to find the application name that corresponds with the application GUID. We can plug in the actual application GUID of the app we want to reducing logging output on into the <APP GUID>

cf curl /v3/apps | jq . | grep -B 50 -A 50 '"guid": "<APP GUID>"' | grep '"name":' --color=auto

- 5. After finding the name of the application in question, we can proceed to set a log rate limit to said application. Doing this will set a limit as to how many logs that application can generate per second. NOTE: Note that in order to use the -l parameter of the cf scale command, you must be using a version of the CF CLI that is above version 8.5. More information can be obtained about this in this KB article.

As an example, we will assign a log rate limit of 100 Bytes or lines per second (which is the default per this KB article). Note that setting the log rate limit restarts the application:

cf scale <App Name> -l 100B

After implementing one or more options stated above, we would expect the result to be a decrease or elimination of the rate of dropped envelopes and decrease in the RLP Message Loss rate: