Why the Loggregator may lose Logs

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

This article discusses why the Loggregator may lose messages in Pivotal Cloud Foundry.

Environment

Product Version: 1.9

Resolution

Loggregator simply transports logs and metrics messages in Pivotal Cloud Foundry Elastic Runtime. It makes information available to users and external log management systems. Persistence of the logs is the responsibility of whatever consumes the logs. Examples would be aggregators such as ELK stacks, Splunk, or simply cf logs.

Log messages not immediately extracted and persisted, are discarded. The exceptions are the small number of logs stored in a buffer and available through cf logs --recent. More details on the components of Loggregator can be found here .

Loggregator Design

Loggregator transports logs using the UDP protocol. The reason for this protocol choice is that Loggregator should be nonblocking to applications. With "fire and forget" UDP, Diego logging mechanisms never block on transmission.

Another design goal was to be performant at scale. The Loggregator supports horizontal scaling by replicating Dopplers, Traffic Controllers, and Nozzles. This "fabric" of components should deliver messages as fast as it is capable of, even if it is not scaled up to a large enough configuration to handle the entire load. Again, UDP was chosen as the protocol from the Metrons to the Dopplers to keep logs flowing to the highest degree. Dopplers simply drop the UDP packets that they are not capable of consuming.

Log Loss

The consequence of these design decisions is that the UDP implementation is not guaranteed reliable. Both UDP links can, and do, lose log messages. UDP messages can be lost in two ways: first, if the network drops the packet, and second if the receiving component doesn't keep up when reading in the UDP messages. The second mechanism is the dominant loss cause in Loggregator and messages can be lost in two scenarios:

1) Metron --> Doppler

2) Application --> Metron

Predicted Message Loss Per-Doppler at Various Loads

Msgs/Sec	Loss/p>
500	0.9%
1000	1.7%
1500	2.6%>
2000	3.5%
2500	<4.3%
3000	5.2%
3500	6.1%
4000	7.0%
4500	8.0%
5000	8.9%
5500	9.8%

How to monitor log loss:

Scenario 1- For loss between Metrons and Dopplers, compare the following metrics:

Messages sent by Metron --> MetronAgent.DopplerForwarder.sentMessages

Messages received by Doppler --> DopplerServer.dropsondeListener.receivedMessageCount

Scenario 2- For loss within an individual Diego VM, compare the following metrics:

Messages sent by Diego Executor --> rep.logSenderTotalMessagesRead

Messages processed by Metron --> MetronAgent.DopplerForwarder.sentMessages

Future Improvements to Loggregator

In PCF 1.10 Loggregator moved from UDP links to gRPC. This will solves Scenario 1, but not Scenario 2. However, Scenario 2 loss will no longer be invisible. It allows explicit notification of when messages are dropped because of a saturated Metron, and that notification will include data on how many messages from each app are discarded.

An appropriate place to keep track of Loggregator changes is its GitHub repo.

Additional Information

For more information on this, please read the logreliabilityincloudfoundryloggregatorjuly2016.pdf white paper.