One of the worker nodes in Aria Operations for Logs fails to report its status in Cassandra due to token collision issue
search cancel

One of the worker nodes in Aria Operations for Logs fails to report its status in Cassandra due to token collision issue

book

Article ID: 369764

calendar_today

Updated On: 06-20-2024

Products

VMware Aria Suite

Issue/Introduction

  • When you run nodetool-no-pass status command, you will notice one or the other worker node is found missing/hidden.

  • But, when you run command nodetool-no-pass describecluster, you will see all nodes in the cluster.

  • Token collision causes one of the nodes to be 'hidden' by the other one - while Cassandra service runs normally on the 'hidden' node and the node isn't visible to the rest of the nodes. Physically, the node doesn't participate in read/write operations and is, in fact, in a stale state. Additionally, of the two worker nodes holding the same token ranges, the one gets 'hidden' that has started earlier - if we restart the nodes or Cassandra service on the nodes, the node/service that comes up later becomes the active one and the other one gets 'hidden'.
  • Currently, in Aria Operations for Logs we are facing token collision errors when trying to add multiple nodes to the cluster simultaneously, without waiting for each node for "Startup complete" action and getting the following stack trace in  /var/log/loginsight/cassandra.log:
    ERROR [main] 2024-06-12T10:48:17,780 CassandraDaemon.java:898 - Exception encountered during startup
    java.lang.RuntimeException: Bootstrap Token collision between /xx.xx.xx.42:7000 and /xx.xx.xx.43:7000 (token 8344507750274794369
            at org.apache.cassandra.locator.TokenMetadata.addBootstrapTokens(TokenMetadata.java:378) ~[apache-cassandra-4.1.0.jar:4.1.0]
            at org.apache.cassandra.locator.TokenMetadata.addBootstrapTokens(TokenMetadata.java:360) ~[apache-cassandra-4.1.0.jar:4.1.0]
            at org.apache.cassandra.service.StorageService.handleStateBootstrap(StorageService.java:2798) ~[apache-cassandra-4.1.0.jar:4.1.0]
            at org.apache.cassandra.service.StorageService.onChange(StorageService.java:2496) ~[apache-cassandra-4.1.0.jar:4.1.0]
            at org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1659) ~[apache-cassandra-4.1.0.jar:4.1.0]
            at org.apache.cassandra.gms.Gossiper.addLocalApplicationStateInternal(Gossiper.java:2057) ~[apache-cassandra-4.1.0.jar:4.1.0]
  • In the same cassandra.log file, you will also see log entries similar to:

    INFO  [main] 2024-06-12T06:04:50,598 StorageService.java:3015 - Nodes /XX.XX.XX.42:7000 and /XX.XX.XX.43:7000 have the same token -1920104746265459352. /XX.XX.XX.42:7000 is the new owner
    INFO  [main] 2024-06-12T06:04:50,599 StorageService.java:3015 - Nodes /XX.XX.XX.42:7000 and /XX.XX.XX.43:7000 have the same token -1982750550513922253. /XX.XX.XX.42:7000 is the new owner

Environment

vRealize Log Insight 8.x

VMware Aria Operations for Logs 8.x

Cause

Token collision errors in Cassandra occurs when two or more nodes in a Cassandra cluster are assigned the same token range.

Resolution

  • Cassandra nodes permanently get assigned token ranges when they bootstrap. The token ranges assigned to them don't get changed after that. In normal situations, each node is owning a unique set of ranges and each range it owns is also replicated to a number of replicas according to the REPLICATION_FACTOR of a keyspace (replication factor is a per-keyspace setting). In order for each node to normally bootstrap, Cassandra advises to join nodes sequentially - wait for the current joining node to bootstrap and complete its process of joining to the cluster, and only then to start joining another node.
  • Token collision has a possibility to come up in Aria Operations for Logs clusters that have more than 2 nodes. The primary node may not have this issue by definition: it's the first node in the cluster deployed independently on the rest of the nodes. For the worker nodes to have token collision, we need to have at least two of them.
  • When multiple nodes claim the same token or token range, data ownership and distribution can become skewed, potentially resulting in some nodes holding more data than others.

Follow below steps to resolve the Token Collision issue:

Note: Take offline snapshot of all nodes in the cluster before implementing the below steps:

1. Identify the 'hidden' node:

1.1. Run nodetool on the primary node to view the list of the nodes: nodetool-no-pass status

1.2. Visit vRLI UI and view the list of the nodes under Management → Cluster. Alternatively, we can check the list of hosts configured in <distributed></distributed> tag in the latest loginsight-config.xml file in /storage/core/loginsight/config.

1.3. The node that is missing in the nodetool's output is the 'hidden' node of interest.

2. Stop vRLI and Cassandra on the 'hidden' node: service loginsight stop

3. If Cassandra is still running, execute: /usr/lib/loginsight/application/sbin/li-cassandra.sh --stopnow --force

4. Navigate to Cassandra data directory on the 'hidden' node: cd /storage/core/loginsight/cidata/cassandra/

5. Remove commitlog(s) on the 'hidden' node: rm -rf commitlog*

6. Remove data on the 'hidden' node: rm -rf data/

7. Launch vRLI and Cassandra on the 'hidden' node: service loginsight start