One or more VCF Operations Log Management service instances have been unavailable for more than 15 minutes. The Log Management availability alert is active reporting one or more instances are down.
Log management instance availability is impacted by application, node, and VM level issues. Log Management runs as a set of service instances inside a VCF Management Services cluster. The cluster provisions and manages the underlying node VMs on the customer's behalf — it decides how many node VMs are required, places service instances on them, restarts instances after a VM or process failure, and adds VMs when a component is scaled out. The cluster also monitors the health of the node OS and VMs, and will replace the node if the unhealthy condition continues.
Some unavailability is expected while lifecycle operations are in progress, including:
Aside from these planned operations, an outage that persists beyond 15 minutes is unplanned and indicates a probable underlying issue. The most common causes, and how to address them, are described below.
VCF Operations Log Management 9.1
VCF Management Services cluster hosting the Log Management component
Log Management instance availability is measured by a periodic call to the instance to determine if it is running. If it is not running, the instance is reported as unavailable. The availability metric does not assess whether a running instance is healthy. A running instance may not be healthy for the reasons listed below.
At the application level, log processors depend on a majority of the log store instances being running. If a majority are offline, the log processors cannot process logs, respond to queries, or persist configuration changes. Another application-level reason is load. If the ingestion/query/alert/rule load on the system is excessive, it can cause the instances to periodically restart. In addition, you can optionally mount NFS volumes on the log management instances. If instances get restarted while a NFS volume is not accessible, the instances won’t restart.
VCF Management Services nodes are VMs with a guest operating system. The VMs are powered on, guest running, but the node VM is not healthy and is unable to accept additional instances. In addition, the instances running on the unhealthy node may be running or impacted.
At the VM level, the VMs run in a vSphere cluster, on a vSphere datastore, and communicate over the management network. The VMs are subject to the same infrastructure issues as any other vSphere VM, and standard vSphere troubleshooting applies. Common underlying causes:
Finally, the instances store their state in persistent volumes. An inaccessible volume would prevent an instance from restarting. At the vSphere layer, a persistent volume is a First Class Disk.
Step 1 — Confirm the outage is unplanned
In the VCF Operations UI navigate to the build->lifecycle section and view the task tab. Verify that there are no active lifecycle tasks. If there are, wait for them to complete, and reassess whether the instances are still unavailable.
Step 2 — Evaluate application-level issues
Navigate back to the Log Management Health Summary dashboard. Review the availability section and determine if at least ½ of the log store instances are running. E.g., if you have provisioned 5 instances, 3 must be running, while if you have provisioned 6, 4 must be running.
If the log store instances are running, check the availability timeseries. If the number of available instances is varying over time, with frequent changes in number, some condition is causing the instances to restart. Excessive memory pressure can cause this effect. Check to see if the memory pressure symptom is active or activated at the same time of the changes in availability.
If you have mount NFS volumes on the log management instances to export, import, or archive logs, the volume may be inaccessible, preventing instances from restarting. See KB 418135.
Step 3 – Infrastructure-level Issues
If a log management instance needs to be restarted, the VCF Management Services platform will attempt to restart on a healthy node. If there is no healthy node with capacity, the restart will be blocked until an unhealthy node is replaced. See the Node Health section of KB 417255 for more troubleshooting steps.
Step 4 – Node VM Issues
The node list on the VCF Management Services Runtime VCF Health page lists the names of the nodes. The VMs that back each node has the same name. The VCF Operations inventory includes an object that represents the node and an object that represents the node VM itself. Use this relationship as you explore whether there are VM level issues impacting the node VMs.
Check in VCF Operations for any VM level alerts on the node VMs. In addition, check for whether there are alerts raised on the ESX hosts on which these VMs are running. If there are critical level alerts, address them. Address alerts reporting CPU or memory contention, out of disk space issues, or networking issues.
In addition to checking for alerts, check whether the datastore on which the node VMs are running is impacted by an APD. Or, whether there is a ESX host failure or isolation in the cluster. Such an event could prevent node VMs from being restarted.
Checking whether there are persistent volume issues preventing one or more of the Log Management instances from starting. For this troubleshooting, first determine the persistent volumes for the impacted instances
Note the resulting VM, ESX host, datastore, and persistent volume.
Navigate to VCF Operations > Configurations > Inventory Management to locate the Persistent Volume object and record its Identifier 2 value. Using this identifier as your primary correlation factor, switch to the vCenter UI and navigate to Cluster → Monitor → Cloud Native Storage → Container Volumes to find the matching Volume name. Once located, review the volume's Health Status and Performance, and if the environment is backed by vSAN, verify its Physical Placement to ensure storage compliance and reliability.