VCF Operations Log Management service instances remain offline for longer than expected

Products

VCF Operations

Issue/Introduction

One or more VCF Operations Log Management service instances have been unavailable for more than 15 minutes. The Log Management availability alert is active reporting one or more instances are down.

Log management instance availability is impacted by application, node, and VM level issues. Log Management runs as a set of service instances inside a VCF Management Services cluster. The cluster provisions and manages the underlying node VMs on the customer's behalf — it decides how many node VMs are required, places service instances on them, restarts instances after a VM or process failure, and adds VMs when a component is scaled out. The cluster also monitors the health of the node OS and VMs, and will replace the node if the unhealthy condition continues.

Some unavailability is expected while lifecycle operations are in progress, including:

Initial deployment of Log Management.
Adding Log Management instances or scaling up to a larger component size.
Scaling the VCF Management Services cluster itself to a larger size.
Patch or upgrade operations.
Rebalancing that occurs after another component is deployed, scaled, or removed

Aside from these planned operations, an outage that persists beyond 15 minutes is unplanned and indicates a probable underlying issue. The most common causes, and how to address them, are described below.

Environment

VCF Operations Log Management 9.1

VCF Management Services cluster hosting the Log Management component

Cause

Log Management instance availability is measured by a periodic call to the instance to determine if it is running. If it is not running, the instance is reported as unavailable. The availability metric does not assess whether a running instance is healthy. A running instance may not be healthy for the reasons listed below.

At the application level, log processors depend on a majority of the log store instances being running. If a majority are offline, the log processors cannot process logs, respond to queries, or persist configuration changes. Another application-level reason is load. If the ingestion/query/alert/rule load on the system is excessive, it can cause the instances to periodically restart. In addition, you can optionally mount NFS volumes on the log management instances. If instances get restarted while a NFS volume is not accessible, the instances won’t restart.

VCF Management Services nodes are VMs with a guest operating system. The VMs are powered on, guest running, but the node VM is not healthy and is unable to accept additional instances. In addition, the instances running on the unhealthy node may be running or impacted.

At the VM level, the VMs run in a vSphere cluster, on a vSphere datastore, and communicate over the management network. The VMs are subject to the same infrastructure issues as any other vSphere VM, and standard vSphere troubleshooting applies. Common underlying causes:

Datastore All Paths Down (APD) condition — node VM I/Os are delayed including those issued by instances running on VM, node VM is powered down and cannot be restarted
ESXi host failure or isolation — host down, network‑isolated, or unexpectedly in maintenance mode; vSphere HA cannot or did not restart the affected node VMs, resulting in a reduction in Management Services capacity.
Cluster CPU/memory contention — node VMs cannot power on or be migrated by DRS. DRS disabled or migration blocked — node VMs stranded on a failed or saturated host.

Finally, the instances store their state in persistent volumes. An inaccessible volume would prevent an instance from restarting. At the vSphere layer, a persistent volume is a First Class Disk.

Resolution

Step 1 — Confirm the outage is unplanned

In the VCF Operations UI navigate to the build->lifecycle section and view the task tab. Verify that there are no active lifecycle tasks. If there are, wait for them to complete, and reassess whether the instances are still unavailable.

Step 2 — Evaluate application-level issues

Navigate back to the Log Management Health Summary dashboard. Review the availability section and determine if at least ½ of the log store instances are running. E.g., if you have provisioned 5 instances, 3 must be running, while if you have provisioned 6, 4 must be running.

If the log store instances are running, check the availability timeseries. If the number of available instances is varying over time, with frequent changes in number, some condition is causing the instances to restart. Excessive memory pressure can cause this effect. Check to see if the memory pressure symptom is active or activated at the same time of the changes in availability.

If you have mount NFS volumes on the log management instances to export, import, or archive logs, the volume may be inaccessible, preventing instances from restarting. See KB 418135.

Step 3 – Infrastructure-level Issues

If a log management instance needs to be restarted, the VCF Management Services platform will attempt to restart on a healthy node. If there is no healthy node with capacity, the restart will be blocked until an unhealthy node is replaced. See the Node Health section of KB 417255 for more troubleshooting steps.

Step 4 – Node VM Issues

The node list on the VCF Management Services Runtime VCF Health page lists the names of the nodes. The VMs that back each node has the same name. The VCF Operations inventory includes an object that represents the node and an object that represents the node VM itself. Use this relationship as you explore whether there are VM level issues impacting the node VMs.

Check in VCF Operations for any VM level alerts on the node VMs. In addition, check for whether there are alerts raised on the ESX hosts on which these VMs are running. If there are critical level alerts, address them. Address alerts reporting CPU or memory contention, out of disk space issues, or networking issues.

In addition to checking for alerts, check whether the datastore on which the node VMs are running is impacted by an APD. Or, whether there is a ESX host failure or isolation in the cluster. Such an event could prevent node VMs from being restarted.

Checking whether there are persistent volume issues preventing one or more of the Log Management instances from starting. For this troubleshooting, first determine the persistent volumes for the impacted instances

In the VCF Operations UI, open the Dashboards section
If the VCF Log Management Drilldown dashboard is not visible, click Manage, search for logs, and activate the dashboard.
Open the dashboard and go to the Availability section.
Click through the listed log store and log processor instances and record the ones that are reported as unavailable.
Next, use the VCF Operations UI object search to locate the ops-logs object, open its inventory page, and select the Topology tab.
From ops-logs object, follow the following object traversal: ops-logs → {log store | log processor} → service instance → persistent volume → datastore

Note the resulting VM, ESX host, datastore, and persistent volume.

Navigate to VCF Operations > Configurations > Inventory Management to locate the Persistent Volume object and record its Identifier 2 value. Using this identifier as your primary correlation factor, switch to the vCenter UI and navigate to Cluster → Monitor → Cloud Native Storage → Container Volumes to find the matching Volume name. Once located, review the volume's Health Status and Performance, and if the environment is backed by vSAN, verify its Physical Placement to ensure storage compliance and reliability.