Ingestion TroubleShooting

Products

VCF Operations

Issue/Introduction

This guide is intended to help administrators self-diagnose and resolve common ingestion issues in Operations for Logs. If you are experiencing log delivery failures, unexpected gaps in log streams, or notice high-latency ingest pipelines causing queue build-ups, or experiencing unexpected data loss, this document can help you navigate through some key concepts, diagnostic steps, and mitigation options.

Operations for Logs events ingestion is a multi-stage pipeline that has three stages:

Gateway
Log Processor and finally,
Log Store.

Event sources connect to the gateway. Event messages are processed by the log processor and persisted by the log store.

Issues

> Total Rejected Connections > 5% Over Last Six Hours

Environment

VCF 9.1

Cause

If it happens that too many clients are trying to send log messages to the same FQDN/VIP and it is beyond its limits, or

the clients are not using SSL and the system is configured to require SSL, then you may see connection drops.

Dropped connections may also occur if the system is queuing messages and the queue cannot process all the accepted messages.

Resolution

Check the status of load across the FQDNs/VIPs, and add additional FQDNs/VIPs, if you are overshooting the per gateway connection budgets. In this case, adding a new FQDN/VIP can be done from the Log Management Configuration page, to distribute the connection load. Or you may be under-provisioned for the amount of sources trying to connect and send an amount of logs that’s beyond what the system can process. See the next issue “Ingest queue used capacity averaged > 10% over last six hours”

> Ingest queue used capacity averaged > 10% over last six hours

Cause

Events are queued when the incoming rate exceeds the rate at which they can be processed or when availability issues block persisting events to the final Log Store. It can also be caused by environmental issues like, slower disk or under-performing host that hosts the LogStore application. These can cause back pressure to the system causing it to queue messages.

When the queue is full, events are dropped. Queued events are gradually processed, consuming some pipeline capacity until queue is empty.

Resolution

Check the Accepted Events Ingestion History for large spikes or sustained increase in the last 6 hours. A sudden spike can cause some queue build-up, but usually that gets processed when the load drops.

Next, check Log Store availability. Review the availability alert for any availability issues,

Otherwise, either reduce the log volume, reduce the processing, add Log Management instances, or scale up to a larger Log Management component size. Ways to reduce processing include fewer masking or forwarding rules, fewer partitions, indexing fewer fields, fewer alerts, remediating issues blocking log-service instances from restarting, ensuring the management service node VMs are not experiencing CPU or memory contention, and VM storage read/write latency is within the expected range.

> Dropped Events >5% Over The Last Six Hours

This is very similar and related to the issue “Ingest queue used capacity averaged > 10% over last six hours”

Cause

Events are dropped when the incoming rate exceeds the rate at which events can be processed and max queue limit has been reached. Events are also dropped when availability issues block persisting events to the Log Store.

Resolution:

review the availability alert for any availability issues. Otherwise, either reduce the log volume, reduce the processing, or add Log Management instances or scale up to a larger Log Management component size. Ways to reduce processing include fewer masking or forwarding rules, fewer partitions, indexing fewer fields, fewer alerts, remediating issues blocking log-service instances from restarting, ensuring the management service node VMs are not experiencing CPU or memory contention, and VM storage read/write latency is within the expected range.

> Average Write Response Time Above Dynamic Threshold

Cause

Average response time measures how long it takes the log store to process the ingest requests sent to it. The higher the response time, the lower the overall throughput.

Resolution:

The response time depends on the load on the service and the resources allocated to the management services runtime. Review if your load exceeds your capacity, and Scale out/up as necessary. Creating too many partitions can cause extra pressure on the system, so review whether all partitions are required, whether some fields need not be indexed, and ensure the management service node VMs are not experiencing CPU or memory contention, and VM storage read/write latency is within the expected range.

Related situations Not covered by Ingestion Alert Symptoms(above)

> Events are Missing

Events can be missed for all the reasons mentioned above, causing the client sources to see timeouts/errors while unable to send messages to the configured FQDN/VirtualIP. Messages could also be dropped when service instances are offline.

> Events delivered Late

When the system is under load that sustains for a prolonged period of time, events will be queued, but won’t be able to quickly get processed due to the pressure caused by the incoming messages. This can cause some messages to arrive late to the Log Store.

> Clients/Agents reporting timeouts or 429/rejected

All the above can cause clients/agents to see timeouts and rejections. Review your Log Collection Configuration to check if all sources are targeting the same FQDN/VirtualIP, in which case distribute the load by adding more vIPs.

Additional Information

Ingestion issues can be caused by insufficient resources for the VCF Management Services runtime VMs. See this KB 424379 for related troubleshooting.