Aria Operations for Logs upgrade will get stuck right after the primary node upgrade

search cancel

Aria Operations for Logs upgrade will get stuck right after the primary node upgrade

book

Article ID: 318391

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Aria Operations for Logs upgrade fails after primary or one of the worker node reboot ' Primary and worker node Cassandra sync issue'
Aria Operations for Logs upgrade concerns and insights

A difference of time of day between the nodes in a cluster can lead to an inconsistent status of the node and sometimes a failure of the UI to load.

When running the command date there is a time drift observed of more than 2 seconds between the current clock time and/or between each node.

In the /var/log/ntp log you may see "kernel reports TIME_ERROR: 0x41: Clock Unsynchronized" error messages.

In the /storage/core/loginsight/var/runtime.log you may see "[Could not access native clock (see debug logs for details), falling back to Java system clock]" INFO level message.

Environment

Aria Operations for Logs 8.16 and Later

Cause

The time skew between the nodes can be caused by improper NTP configuration.
Time skew can cause the distributed Cassandra database used by the loginsight service to report errors that prevent a rolling cluster upgrade from completing.

Resolution

1. Ensure NTP is configured and time sync between the nodes is only few secs (2 or 3 seconds) difference.

To check NTP is configured with a NTP server, check an external server is set in:

/etc/ntp.conf

Check the current time on the node: date

To force a sync with the NTP server:

systemctl stop ntpd
ntpdate -u <ntp_server_IP_or_FQDN>
systemctl start ntpd

2. Make sure we don't see too many 'cassandra commitlog' and zero byte hints files.

Reference Link to remove Hints File: Error "Failed to dispatch hints file" "file is corrupted" in cassandra.log

3. Shut down the loginsight service, then start only Cassandra on all the nodes to prepare to perform a Cassandra repair

service loginsight stop
/usr/lib/loginsight/application/sbin/li-cassandra.sh --startnow --force

4. Check and make sure nodetool-no-pass is reporting all the nodes 'UN' by running

nodetool-no-pass status

5. Perform the Cassandra repair on all the nodes by running the following commands on any one node. Once completed, ensure there are no errors at the end.

nodetool-no-pass flush
nodetool-no-pass repair --full

6. Stop the Cassandra (only) service and start the loginsight service (which will automatically also start the Cassandra)

/usr/lib/loginsight/application/sbin/li-cassandra.sh --stopnow --force
service loginsight start

7. Check and make sure nodetool-no-pass is reporting all the nodes 'UN' by running

nodetool-no-pass status

8. Take a snapshot without memory of all the nodes (No need for offline snapshot).
Reference Link: How to take a Snapshot of VMware Aria Operations for Logs

9. Start the upgrade from UI, Once the Master upgrade is done and reboot is completed. Connect via SSH and check the Cassandra status, It should report all the nodes "UN".

10. The rolling upgrade should be working now, with Cassandra in "UN" state on all the nodes.

Additional Information

Impact/Risks:
Upgrade of Aria Operations for Logs (formerly vRealize Log Insight) is partial and from the UI it will timeout.

Note:
The above issue can be observed during certificate replacement as it requires cassandra service to be up.

Feedback

thumb_up Yes

thumb_down No