Logs drop from firehose to splunk

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Experiencing log loss in their Splunk Tile installation. Doppler.log show consistent "Dropped (egress) X000 envelopes" error.

Environment

Product Version: 2.4

Resolution

The error "Dropped (egress) X000 envelopes" indicates that the configured consumer(s) are not consuming logs at the same rate that Loggregator delivers. This condition can lead to back pressure on the doppler components and eventual dropped envelopes.

Example consumers are 3rd Party nozzles from Splunk or Newrelic, Ingestor applications used by Healthwatch and or Metrics, troubleshooting tools such as cf top and cf logs to stream logs. Generally, the rule of thumb is to scale the number of consumers to match the number of dopplers. The following steps can be used to help determine which consumer is leading to dropped envelopes on the consumer side, If the recommended scaling for the Loggregator component have been done but the problem persist. https://techdocs.broadcom.com/us/en/vmware-tanzu/platform/tanzu-platform-for-cloud-foundry/6-0/tpcf/log-ops-guide.html

1. Use BOSH to ssh to one of the doppler VMs. You might need to cycle between dopplers in an environment with multiple doppler VMs.

2. sudo as root

3. cd to /var/vcap/sys/log/doppler

4. $ tail -f doppler.log

5. If you are experiencing log loss you will observe the error "Dropped (egress) X000 envelopes"

a. The frequency of the errors is indicative of the problem.

6. Use the process of elimination to help determine the consumer which is contributing to the problem

a. Stop Metrics ingestor applications.
b. Stop Healthwatch ingestor applications
c. Stop any 3rd party nozzles which connect to the Loggegator firehose.
d. Stop any plugins that connect to the Loggregator firehose. Here are some help commands to help when troubleshooting this problem.

1. Determine the number of subscriptions to the firehose. The value reported should be consistent with the total number of configured consumers. curl -H "Authorization: $(cf oauth-token)" -G "https://log-cache." --data-urlencode 'query=subscriptions{source_id="doppler"}'

2. Run the following commands to get Ingress and egress statistics. average ingress per doppler per second (as an admin):

curl -G -H "Authorization: $(cf oauth-token)" "https://log-cache./api/v1/query" --data-urlencode 'query=avg(rate(ingress{source_id="doppler"}[5m]))' total ingress of all dopplers per second (as an admin): curl -G -H "Authorization: $(cf oauth-token)" "https://log-cache./api/v1/query" --data-urlencode 'query=sum(rate(ingress{source_id="doppler"}[5m]))' average dropped per doppler per second (as an admin): curl -G -H "Authorization: $(cf oauth-token)" "https://log-cache./api/v1/query" --data-urlencode 'query=avg(rate(dropped{source_id="doppler"}[5m]))' dropped across all dopplers per second (as an admin): curl -G -H "Authorization: $(cf oauth-token)" "https://log-cache./api/v1/query" --data-urlencode 'query=sum(rate(dropped{source_id="doppler"}[5m]))'

3. The expected output is:

a. "status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1574128855.000,"0"]}]}}

b. The value [1574128855.000,"0"] demonstrates the epoch time representation of the call and "0" is the result.

c. These calls help confirm that loggregator is functional and log loss is downstream on the consumers.