Checklist:
The below described scaling concerns are being kept together for the purpose of this knowledge article as they can interrelate when supporting customers experiencing missing or complete log and metric loss.
Doppler & Traffic Controller Scaling Concerns
Ceiling for the number of Doppler & Traffic Controller Instances
As a reference, we generally recommend the following as
Doppler scale guidance.
Issue: Mature organizations running a lot of AIs (5,000+) start to hit a scale ceiling on Loggregator (approx 40 Dopplers and 20 Traffic Controllers). The issue is a M x N (Doppler x Traffic Controllers) problem.
Next Steps: At this instance limit, the best course of action is to add more vertical scale (increase CPU resources) to the existing Doppler and TrafficController instances to add more headroom.
Doppler(s) showing unusually high CPU
Issue: Some customers reporting Doppler scale issues, have also reported unusually high CPU on the Doppler instances, which appears to also impact Loggregator performance. These issues manifest in the following ways:
- Doppler VMs are not balanced, with some running at 60% and some at 100%
- BOSH restarting Doppler VMs with high frequency
- App Developers complaining to Operators that they cannot see their app logs
- App Developers complaining to Operators that they can only see partial app logs
Next Steps: Ensure the customer is on a PCF version higher than: 2.2.3, 2.1.10 and 2.0.19
Release Notes: [Feature Improvement] Loggregator agent egresses preferred tags instead of DeprecatedTags in loggregator envelopes. This fixes a high CPU issue in Doppler cluster.
CF Syslog Drain Scaling Concerns
Ceiling for the number of CF Syslog Drain Bindings
For shipping App logs to an external service like ELK or Splunk, Loggregator has been recommending customers use the colocated
cf-syslog-drain-release and configure CUPS using something like the
cf space-drain to bind apps to the customer’s syslog endpoints.
Issue: However, at approximately 10,000 drain bindings, there appears to be a ceiling at which SLO performance will start to decline. When at this approximate “max number of drains”, additional scaling of Loggregator components will not improve the issue.
Note: If the customer is leveraging
PCF Healthwatch v1.4+, the customer’s current number of active drains can be seen on the Logging Performance details page.
The best options to resolve this scale issue are not desirable recommendations:
- (in the future) leverage Isolated Loggregator via Isolation Segments
- Why not desirable - This is not currently possible in PAS. However, it is expected to become an available option in PAS v2.5
- Spin up more PCFs and move some of the AI traffic to the new foundations
- Why not desirable - This suggestion would mean additional cost would have to be incurred by the customer for the overhead of an entirely new PCF
Next Steps:
- Consider pushing traffic to another, less AI heavy PCF foundation
- Leverage the (community maintained) Firehose-to-Syslog nozzle for sending logs to an external provider such as ELK
- If sending logs to an external Splunk, then leverage the partner-supported Splunk tile