Known Loggregator Scaling Issues

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

In this article we discuss identified Loggregator scaling limitation issues, of which there are more than one. These issues have been observed in all recent versions of Pivotal Cloud Foundry (PCF), and particularly impact large, PCF-mature customers running a high number of Application Instances (AIs).

Pivotal has identified the scaling issues with Loggregator components and services, that have recently surfaced at several large PCF-mature customers that are running 5,000+ Application Instances (AIs) .

These performance issues have manifested as two key scaling concerns:

Doppler & Traffic Controller components (Max number of Dopplers and Traffic Controllers)

There appears to be a maximum number of Doppler and TrafficController instances (approx 40 Dopplers and 20 TrafficControllers) at which performance will begin to decline, and horizontal scaling is no longer a useful approach for issue resolution

App Logs via CF Syslog Drain feature (Max number of Drain Bindings)

There appears to be a maximum number of CF Syslog Bindings (approximately 10,000) at which performance will begin to decline, and horizontal/vertical scaling of the Loggregator components will no longer be useful for issue resolution.

Symptoms:
Generally when PCF Operators notice the log loss occurring, we suggest scaling the Loggegator components. However there is now a known maximum scale ceiling on some of the Loggregator components, whereby scaling up the number of instances will no longer improve or resolve the issue.

Environment

Product Version: 2.0

Resolution

Checklist:
The below described scaling concerns are being kept together for the purpose of this knowledge article as they can interrelate when supporting customers experiencing missing or complete log and metric loss.

Doppler & Traffic Controller Scaling Concerns

Ceiling for the number of Doppler & Traffic Controller Instances

As a reference, we generally recommend the following as Doppler scale guidance.

Issue: Mature organizations running a lot of AIs (5,000+) start to hit a scale ceiling on Loggregator (approx 40 Dopplers and 20 Traffic Controllers). The issue is a M x N (Doppler x Traffic Controllers) problem.

Next Steps: At this instance limit, the best course of action is to add more vertical scale (increase CPU resources) to the existing Doppler and TrafficController instances to add more headroom.

Doppler(s) showing unusually high CPU

Issue: Some customers reporting Doppler scale issues, have also reported unusually high CPU on the Doppler instances, which appears to also impact Loggregator performance. These issues manifest in the following ways:

Doppler VMs are not balanced, with some running at 60% and some at 100%
BOSH restarting Doppler VMs with high frequency
App Developers complaining to Operators that they cannot see their app logs
App Developers complaining to Operators that they can only see partial app logs

Next Steps: Ensure the customer is on a PCF version higher than: 2.2.3, 2.1.10 and 2.0.19
Release Notes: [Feature Improvement] Loggregator agent egresses preferred tags instead of DeprecatedTags in loggregator envelopes. This fixes a high CPU issue in Doppler cluster.

CF Syslog Drain Scaling Concerns

Ceiling for the number of CF Syslog Drain Bindings

For shipping App logs to an external service like ELK or Splunk, Loggregator has been recommending customers use the colocated cf-syslog-drain-release and configure CUPS using something like the cf space-drain to bind apps to the customer’s syslog endpoints.

Issue: However, at approximately 10,000 drain bindings, there appears to be a ceiling at which SLO performance will start to decline. When at this approximate “max number of drains”, additional scaling of Loggregator components will not improve the issue.

Note: If the customer is leveraging PCF Healthwatch v1.4+, the customer’s current number of active drains can be seen on the Logging Performance details page.

The best options to resolve this scale issue are not desirable recommendations:

(in the future) leverage Isolated Loggregator via Isolation Segments
- Why not desirable - This is not currently possible in PAS. However, it is expected to become an available option in PAS v2.5
Spin up more PCFs and move some of the AI traffic to the new foundations
- Why not desirable - This suggestion would mean additional cost would have to be incurred by the customer for the overhead of an entirely new PCF

Next Steps:

Consider pushing traffic to another, less AI heavy PCF foundation
Leverage the (community maintained) Firehose-to-Syslog nozzle for sending logs to an external provider such as ELK
If sending logs to an external Splunk, then leverage the partner-supported Splunk tile