Time sync/drift issues between Aria Operations for Logs nodes causes multiple issues

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

If any of the following symptoms are seen, use the steps in the Resolution section to verify NTP sync status on all nodes in the Aria Operations for Logs cluster.

UI might not be available while the nodes are up and running.

Entries like below are seen in the /storage/core/loginsight/var/cassandra.log file

[HintsDispatcher:1] 2025-01-21T00:00:00,000 HintsDispatchExecutor.java:294 - Finished hinted handoff of file hintfile_uuid.hints to endpoint /##.##.##.##:7000: hintfile_uuid

Logging in to the UI as the local admin user results in "Error authenticating user"
API call results in HTTP 500 Internal Server Error

After custom certificates are successfully updated using Lifecycle Manager, it's noticed that a node is offline which is causing the cluster to be offline

Reviewing the /storage/core/loginsight/var/runtime.log file, you may see entries related to the Cassandra database not starting and native clock:

2025-07-08 20:38:32.100+0000] ["main"/##.##.##.## INFO] [com.vmware.loginsight.daemon.LogInsightDaemon] [Exception during start cassandra database]
com.vmware.loginsight.daemon.LogInsightDaemon$StartupFailedException: Daemon startup failed: Failed to start Cassandra Server: StartupException(description:Unable to connect to Cassandra node at 0.0.0.0:9042: com.datastax.oss.driver.api.core.AllNodesFailedException: Could not reach any contact point, make sure you've provided valid addresses (showing first 1 nodes, use getAllErrors() for more): Node(endPoint=##.##.##.##:9042, hostId=null, hashCode=########): [com.datastax.oss.driver.api.core.DriverTimeoutException: [s78|control|id: 0x5bff502c, L:/##.##.##.##:54368 - R:/##.##.##.##:9042] Protocol initialization request, step 3 (AUTH_RESPONSE): timed out after 5000 ms]).


2025-07-08 20:39:48.590+0000] ["s0-admin-1"/##.##.##.## WARN] [com.datastax.oss.driver.internal.core.control.ControlConnection] [[s0] Error connecting to Node(endPoint=##.##.##.##:9042, hostId=null, hashCode=########), trying next node (AnnotatedConnectException: Connection refused: /##.##.##.##:9042)]
[2025-07-08 20:39:50.618+0000] ["main"/##.##.##.## INFO] [com.vmware.loginsight.cassandra.CassandraServerController] [No cassandra hosts available after 3721 ms wait]

[2025-07-08 20:39:48.342+0000] ["s0-admin-0"/##.##.##.## INFO] [com.datastax.oss.driver.internal.core.time.Clock] [Could not access native clock (see debug logs for details), falling back to Java system clock]

Checking with the date command on all cluster nodes, you can see a time difference between one or more cluster nodes of a few minutes

Environment

VMware Aria Operations for Logs 8.x

Cause

The internal Cassandra database is sensitive to time drift between nodes in the cluster. Any drift over 2 seconds should be resolved to allow for database operations to succeed.

Resolution

Verify that the appliance VMs can communicate and sync with the configured NTP servers

Log in to the Aria Operations for Logs appliance as root via SSH or vSphere Console
Query the time sync status
```
ntpq -p
```
Verify that the reach value is 377

Note: reach is an octal counter that indicates the status of the last 8 attempts to contact the configured NTP server.

0 indicates a failed contact
1 indicates a successful contact

377 = 11111111 in binary (meaning all of the last 8 contacts were successful)

The reach count will restart each time the VM or the ntpd service is restarted. Verify that sufficient time has passed since the last restart for 8 contact attempts when checking.

If reach is not 377, network troubleshooting steps such as ping and telnet should be used to verify network connectivity with the configured NTP servers. Review firewalls (external to the appliance VM) to verify that NTP traffic is allowed between all appliance VMs and the configured NTP servers.
Verify that the offset value is between -60000 and 60000

Note: offset is the value in milliseconds that the time on the appliance differs from the NTP server

If the value exceeds 60,000ms (60s) in either direction, manually sync the time with the NTP server
1. Stop the ntpd service
```
systemctl stop ntpd
```
2. Sync the time with the preferred NTP server
```
ntpdate ntp_server_ip_or_fqdn
```
  Note: Replace ntp_server_ip_or_fqdn with the IP or FQDN of the preferred NTP server. Use the same NTP server for all nodes.
3. Start the ntpd service
```
systemctl start ntpd
```
Repeat steps 1-4 on all nodes in the cluster

Additional Information

To update the NTP settings, reference: Synchronize the Time on the VMware Aria Operations for Logs Virtual Appliance

In environments where NTP is unavailable or unreliable, use the ESXi hosts as the source of time for the appliance VMs. Verify the ESXi hosts all utilize the same time providers and are in sync with each other.

In environments where there is drift between multiple configured NTP servers, use the same single NTP provider for all appliance VMs in the cluster.