"FAILED: Unable to get user data. Possible Cassandra is down"

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Unable to login Web Interface of Aria Operations for Logs. Attempts to login fail with "Error authenticating user"
When trying to check the status of admin password, get the following error: "FAILED: Unable to get user data. Possible Cassandra is down."
Running the li-reset-admin-passwd.sh script results in error output: "FAILED: Unable to get user data. Possible Cassandra is down."

Following event seen in /var/log/loginsight/runtime.log

java.util.concurrent.ExecutionException: com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded).

The log file: /storage/core/loginsight/var/cassandra.log contains entries similar to the following:

INFO  [main] NativeTransportService.java:73 - Netty using Java NIO event loop
WARN  [main] NativeTransportService.java:166 - epoll not available
java.lang.UnsatisfiedLinkError: /tmp/libnetty_tranport_native_epoll_x86_##################.so: /tmp/libnetty_transport_native_epoll_x86_##################.so: failed to map segment from shared object
        at java.lang.ClassLoader$NativeLibrary.load0(Native Method) ~[?:?]
        at java.lang.ClassLoader$NativeLibrary.load(Unknown Source) ~[?:?]
        at java.lang.ClassLoader$NativeLibrary.loadLibrary(Unknown Source) ~[?:?]

Environment

VMware Aria Operations for Logs 8.x

Cause

The Cassandra database is in an inconsistent state, resulting in data retrieval failures.

Resolution

Take a snapshot of a cluster without a memory before making any changes.
How to take a snapshot
Log in to each Aria Operations for logs node over SSH and run the following command on each node
nodetool-no-pass status

Each node should have a status of UN, a status of DN indicates that the Cassandra service is not running as expected on that node (UN = Up Node ; DN = Down Node)
Stop loginsight Daemon service on all nodes:
systemctl stop loginsight
Start Cassandra on each nodes/usr/lib/loginsight/application/sbin/li-cassandra.sh --startnow --force
Check the status of Cassandra on all nodes again using the command from step 1. again
If you see that some nodes are not UN status, stop/start cassandra on those nodes

/usr/lib/loginsight/application/sbin/li-cassandra.sh --stopnow --force/usr/lib/loginsight/application/sbin/li-cassandra.sh --startnow --force
If all nodes are up (UN status) run flush/repair on all nodes:

nodetool-no-pass flushnodetool-no-pass repair
Once repair is over, stop Cassandra and start loginsight Daemon on each node. Make sure a node is up and running before proceeding to the next one:
/usr/lib/loginsight/application/sbin/li-cassandra.sh --stopnow --forcesystemctl start loginsight
After starting all of the nodes make sure that Cassandra is up and running on each node

nodetool-no-pass status

Note:

Remove all snapshots as soon as they are no longer needed.

Snapshots existing for longer than 72 hours will cause performance issues. Check Best practices for using VMware snapshots in the vSphere environment for more details.

Additional Information

If you still encounter the error do proceed with performing sequential reboot of all the nodes.

Power down all the nodes in the Aria Operations for Logs cluster
Power on the primary node and wait until blue Aria Operations for Logs splash screen is seen in vSphere console
Power on all remaining worker nodes
Verify the Aria Operations for Logs UI is available (can take up to 20 minutes after all nodes are powered on)

"FAILED: Unable to get user data. Possible Cassandra is down" - Aria Operations for Logs

Article ID: 389806