Troubleshooting event archive failures on the Log Management Health Dashboard
search cancel

Troubleshooting event archive failures on the Log Management Health Dashboard

book

Article ID: 418136

calendar_today

Updated On:

Products

VCF Operations

Issue/Introduction

The Log Management Health Dashboard reports that event archive failures are greater than zero. This metric indicates that one or more archive operations did not finish successfully.

Log Management stores log messages into a set of special files called indexes. There is a separate set of indexes for each partition. Newly received log messages are written into an active index. The active index is periodically closed for new writes and a new one is created.

Archiving runs in the background after the active index is closed. a small metadata file is written to the designated external archive location (NFS or on-premises S3-compatible object storage), then the log store creates a snapshot of the just-closed  index in the archive repository for that log partition. Failures can happen if external storage or the log store is unhealthy, if the index is not ready for a snapshot, if the metadata file cannot be written, or if the snapshot step fails (including when a snapshot grows too large).

Environment

  • VMware Cloud Foundation Log Management 9.1
  • Log Management Health Dashboard
  • External event archive targets: NFS and on-premises S3-compatible object

Cause

Event archive failures can stem from one or more of the following:

External storage (NFS or S3-compatible object storage)

  • Metadata for the archive must be written to the path you configured before the snapshot is created. The system checks that the external storage is reachable before it writes the file.
  • NFS: Problems include an unavailable or read-only volume, full volume or quota, permission errors, stale file handle, or the server not reachable or timing out. Creating a folder or writing the file can fail. 
  • S3-compatible object storage: Problems include errors from the on-premises object storage service, or client issues such as network, TLS, or credentials when uploading the metadata object. S3-compatible here means the API shape used by many on-premises object store products, not a specific public cloud.

Log store health

  • Before creating a snapshot, the system checks the index for unassigned shards. If any shards are unassigned, the index is not fully healthy on the log store; the archive is then retried up to a set number of times. If the system cannot check shard state because of an error, it may also retry the archive.
  • Wider log store issues (for example, nodes offline, cluster health not green, or internal errors when reading index information) can block building metadata or completing the snapshot.

Resolution

On the Log Management Health Dashboard, confirm the alert and use logs for your deployment to tie failures to a time range, log partition, or index name as mentioned below under “Logs to look for”.

  • Confirm archiving is enabled for the log partition and that the correct external storage and base path are selected. Use external storage validation or test features in the UI if available.
  • If logs point to background processing or task scheduling, check overall platform health for the logging services.
  • NFS: Confirm the share is mounted and writable;
    • In the Ops UI, open External storage page, select the storage object used by the partition’s archive configuration, run Test / Validate connection; fix any reported failure.
    • Read the failed archive task error message in log processor logs (see sample below). Messages often name the path, and typically call out permission denied, stale file handle, read-only file system, no space left on device, or connection or timeout errors — use these to decide whether the problem is mount, permissions, capacity, or network.
    • Confirm on the NFS server that the export exists, the client network can reach the server, and the export options allow read-write for the cluster. 
  • S3-compatible object storage: 
    • In the Ops UI, open External storage page, select the storage object used by the partition’s archive configuration, run Test / Validate connection; fix any reported failure.
    • Confirm the service endpoint, credentials, bucket and object prefix, TLS, and network access to your on-premises object store. Use your vendor’s management console or health tools as needed. Resolve API or credential errors if seen in Logs for the metadata upload path.
  • Log store: Return the cluster to good health (green or yellow). Fix unassigned shards on affected indices so all shard copies are assigned before automatic archive retries run out. Address node outages, disk pressure, and related cluster issues using KB 418135.
  • After you fix the underlying issue, keep watching the dashboard until event archive failures return to zero. Some transient errors clear on their own after automatic retries.

Logs to look for (log-processor)

Normal progress (informational)

  • Lines that include “Executing log archive task”, “Starting archive for task, with index:”, “Starting archive for index”, “Completed archive for index”, “Successfully completed archive for task”, or “Skipping archive for index ... no data present”.

Shard readiness and retries (often transient)

  • “Unassigned shards present for index”, or text that says the “archive task will be retried”, or “Task ... will be retried. Retry count:”

Hard failures

  • “Archive failed for task”, or task execution errors that mention the archive task id and index together.

External storage and metadata upload

  • S3-compatible: “Failed to upload archive metadata file” along with bucket name, object path, or service-side upload errors next to those lines.
  • NFS: “Failed to write file”, “Failed to create directory” or other messages like insufficient disk space, disk quota exceeded, Permission denied, stale file handle or stale NFS, read-only file system, connection refused, host is down, network is unreachable, timed out, or wording that the NFS server is unreachable.

Additional Information

Archive operations may be retried automatically for certain short-lived conditions (for example, unassigned shards on the index). Retries are limited; ongoing failures need investigation.