Log message output too high. We've dropped 100 messages.This article will provide some background about this message and give some tips for troubleshooting it.
Loggregator, the component in Cloud Foundry which handles application logs, will emit the Log message output too high. We've dropped 10000 messages message when an endpoint (websocket or syslog drain) can't keep up.
Loggregator keeps a buffer of 10000 messages. Once that limit is reached, the 10001st message will cause the buffer to be truncated. The warning message will be added to the buffer and then the 101st message will be added, resulting in a new buffer size of 2 messages. This message is only sent when there are issues with external syslog or websocket endpoints. It is not emitted when there are issues inside or between CF components.
The buffer could be overflowing for a variety of reasons. It could be because the connection to the server has timed out (or other connection issues). It could be due to slow blocking writes. In the case of websockets it could be that the remote side has stopped responding to "pings".
Basically Loggregator assumes that CF operators don't have infinite resources, that it is running in a cloud environment that can be somewhat hostile (machines going up and down, networks that do experience packet loss etc.), and that we are operating in a multi-tenant environment were we need to attempt to provide fair shares of available resources to all applications. Loggregator attempts to make trade-offs to protect itself and other tenants when faced with large log volumes or slow/unavailable downstream log consumers. It also attempts to never put back pressure on the application or CF component when faced with large log volumes or resource constraints. It assumes you'd rather have your application serve your clients while maybe losing a log message here and there, as opposed to having your application and users "waiting" for a log pipeline to clear due to a slow external log consumer.
The Loggregator engineering team is of the opinion that in a large distributed environment that spans internal/external networks and third-party systems, on shared/limited hardware, you can never build a system that can deliver 100% of the messages 100% of the time while still ensuring adequate response time from the logging application. This is very much in line with the CAP theorem.
The long and short is that if you need to ensure your data in your log messages is "never" lost, logs are not the best place for this data. Rather you should put this information in a more formal datastore like a database. Such systems make different decisions/trade-offs and use different protocols. In theory this results in a much better chance that your data will be safely stored.
Because there is not one definitive cause to this problem, there is also not one definitive solution. Here are some things to check if you're seeing this message appear.