Troubleshooting dropped or missing log messages

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:
When using an external logging services like this, it's possible that you'll see a message like:

Log message output too high. We've dropped 100 messages.

This article will provide some background about this message and give some tips for troubleshooting it.

Cause

When is this message used?

Loggregator, the component in Cloud Foundry which handles application logs, will emit the Log message output too high. We've dropped 10000 messages message when an endpoint (websocket or syslog drain) can't keep up.

Loggregator keeps a buffer of 10000 messages. Once that limit is reached, the 10001st message will cause the buffer to be truncated. The warning message will be added to the buffer and then the 101st message will be added, resulting in a new buffer size of 2 messages. This message is only sent when there are issues with external syslog or websocket endpoints. It is not emitted when there are issues inside or between CF components.

What would cause the buffer to overflow?

The buffer could be overflowing for a variety of reasons. It could be because the connection to the server has timed out (or other connection issues). It could be due to slow blocking writes. In the case of websockets it could be that the remote side has stopped responding to "pings".

Why is it designed to work like this?

Basically Loggregator assumes that CF operators don't have infinite resources, that it is running in a cloud environment that can be somewhat hostile (machines going up and down, networks that do experience packet loss etc.), and that we are operating in a multi-tenant environment were we need to attempt to provide fair shares of available resources to all applications. Loggregator attempts to make trade-offs to protect itself and other tenants when faced with large log volumes or slow/unavailable downstream log consumers. It also attempts to never put back pressure on the application or CF component when faced with large log volumes or resource constraints. It assumes you'd rather have your application serve your clients while maybe losing a log message here and there, as opposed to having your application and users "waiting" for a log pipeline to clear due to a slow external log consumer.

The Loggregator engineering team is of the opinion that in a large distributed environment that spans internal/external networks and third-party systems, on shared/limited hardware, you can never build a system that can deliver 100% of the messages 100% of the time while still ensuring adequate response time from the logging application. This is very much in line with the CAP theorem.

The long and short is that if you need to ensure your data in your log messages is "never" lost, logs are not the best place for this data. Rather you should put this information in a more formal datastore like a database. Such systems make different decisions/trade-offs and use different protocols. In theory this results in a much better chance that your data will be safely stored.

Resolution

Because there is not one definitive cause to this problem, there is also not one definitive solution. Here are some things to check if you're seeing this message appear.

Look at the log level of your application. If it's using a log level of DEBUG, VERBOSE, TRACE or some other fine grained logging level, consider changing the log level to reduce the output. Alternatively, you could consider being more granular with what you log. Many logging systems will allow you to filter messages so that only messages from specific components are displayed. This or any other strategy to reduce the volume of logging would be helpful.
Ensure that the service you're using is hosted in an environment with adequate bandwidth and on a network which has a low latency connection to TAS.
Make sure you can see log messages with cf logs <app>. If you perform cf logs <app> you will see additional messages regarding connection issues with syslog endpoints. Try doing this and see if it provides any more debugging information.
Make sure that your log lines are actually different lines with a newline character of some type between them.
A common issue with some runtimes or application frameworks is that stdout is buffered. This can result in logs either not making it past the buffer (very little logging, short-lived apps, with improper buffer flush on shutdown/crash) or in a delay.
Check the requirements of your logging service. Some third-party logging services require that you whitelist the group of IP addresses that will be sending logs to the service.