Aria Operations for Logs (Formerly vRealize Log Insight) upgrade might get stuck right after the primary node upgrade
search cancel

Aria Operations for Logs (Formerly vRealize Log Insight) upgrade might get stuck right after the primary node upgrade

book

Article ID: 318391

calendar_today

Updated On:

Products

VMware Aria Suite

Issue/Introduction

Aria Operations for Logs upgrade fails after primary or one of the worker node reboot ' Primary and worker node Cassandra sync issue'
Aria Operations for Logs upgrade concerns and insights

A difference of time of day between the nodes in a cluster can lead to an inconsistent status of the node and sometimes a failure of the UI to load.

When running the command date there is a time drift observed of more than 2 seconds between the current clock time and/or between each node.

In the /var/log/ntp log you may see "kernel reports TIME_ERROR: 0x41: Clock Unsynchronized" error messages.

In the /storage/core/loginsight/var/runtime.log you may see "[Could not access native clock (see debug logs for details), falling back to Java system clock]" INFO level message.

Environment

Aria Operations for Logs 8.x (VMware vRealize Log Insight 8.x)


Cause

The time skew between the nodes can be caused by improper NTP configuration.
Time skew can cause the distributed Cassandra database used by the loginsight service to report errors that prevent a rolling cluster upgrade from completing.

Resolution

1. Ensure NTP is configured and time sync between the nodes is only few secs (2 or 3 seconds) difference.

To check NTP is configured with a NTP server, check an external server is set in: /etc/ntp.conf 

Check the current time on the node: date 

To force a sync with the NTP server:
systemctl stop ntpd
ntpdate -u <ntp_server_IP_or_FQDN>
systemctl start ntpd

2. Make sure we don't see too many 'cassandra commitlog' and zero byte hints files

3. Shut down the loginsight service, then start only Cassandra on all the nodes to prepare to perform a Cassandra repair

      service loginsight stop
      /usr/lib/loginsight/application/sbin/li-cassandra.sh --startnow --force

4. Check and make sure nodetool-no-pass is reporting all the nodes 'UN' by running

      nodetool-no-pass status

5. Perform the Cassandra repair on all the nodes by running the following commands on any one node. Once completed, ensure there are no errors at the end.

      nodetool-no-pass flush
      nodetool-no-pass repair --full

6. Stop the Cassandra (only) service and start the loginsight service (which will automatically also start the Cassandra)

      /usr/lib/loginsight/application/sbin/li-cassandra.sh --stopnow --force
      service loginsight start

7. Check and make sure nodetool-no-pass is reporting all the nodes 'UN' by running

      nodetool-no-pass status

8. Take a snapshot without memory of all the nodes (No need for offline snapshot)
9. Start the upgrade from UI, Once the Master upgrade is done and reboot is completed. Connect via SSH and check the Cassandra status, It should report all the nodes "UN"
10. The rolling upgrade should be working now, with Cassandra in "UN" state on all the nodes

Additional Information

Impact/Risks:

Upgrade of Aria Operations for Logs (formerly vRealize Log Insight) is partial and from the UI it will timeout