VCF/Aria Operations for Logs VIP remains 'Unavailable' and manual database updates fail due to Cassandra Tombstones
search cancel

VCF/Aria Operations for Logs VIP remains 'Unavailable' and manual database updates fail due to Cassandra Tombstones

book

Article ID: 436198

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

The Integrated Load Balancer (VIP) for an VCF/Aria Operations for Logs multi node cluster is stuck in an "Unavailable" state.
When following the steps to manually elect a leader as outlined in the following KB VIP for Aria Operations for Logs is in Unavailable status (398042), the UPDATE or DELETE queries execute successfully in cqlsh; however, the election.leaders and loadbalancer.leaders tables remain empty.

The /storage/core/loginsight/var/runtime.log will show the nodes rapidly flapping and fighting for leadership multiple times per second:

["LeaderElectionStateUpdaterScheduler-thread-2"/<IP_ADDRESS> INFO] [com.vmware.loginsight.election.cassandra.PullBasedLeaderElection] [Discovered that leadership was lost, attempt to become leader, Group: defaultLeadersGroup, ParticipantId: ########-####-####-####-############]["LeaderElectionStateUpdaterScheduler-thread-2"/<IP_ADDRESS> INFO] [com.vmware.loginsight.election.cassandra.PullBasedLeaderElection] [Became leader again, Group: defaultLeadersGroup, ParticipantId: ########-####-####-####-############]

Additionally, the /storage/core/loginsight/var/cassandra.log will show write failures and massive numbers of tombstone cells being read:

WARN  [ReadStage-1] <DATE_TIME> ReadCommand.java:605 - Read 0 live rows and 6182 tombstone cells for query SELECT * FROM machine_learning.spock_global_queries_v2 WHERE bucket = 0 LIMIT 5000 ALLOW FILTERING; token -#################### (see tombstone_warn_threshold)
...
java.util.concurrent.ExecutionException: com.datastax.oss.driver.api.core.servererrors.WriteFailureException: Cassandra failure during write query at consistency TWO (2 responses were required but only 0 replica responded, 1 failed)

Environment

  • Aria Operations for Logs 8.18.x
  • VCF Operations for Logs 9.x.x

Cause

When nodes in a VCF/Aria Operations for Logs cluster lose communication or experience a split-brain scenario, they continuously attempt to claim leadership. Because Cassandra does not instantly erase deleted data, every microscopic battle for leadership writes a "Tombstone" to the database to mark previous entries as deleted.
This rapid flapping generates thousands of tombstones localized to the election.leader and loadbalancer.leader tables. These tombstones create a "shadow" over the tables. When attempted to manually run UPDATE or DELETE commands, the existing tombstones (which may have newer timestamps) silently block the new data from being committed. The KB VIP for Aria Operations for Logs is in Unavailable status (398042) steps fail because the application background threads are actively fighting the manual database inputs while drowning in tombstones.

Resolution

To bypass the tombstone shadow and force a new leader, the affected tables must be truncated (which destroys tombstones instantly) and repopulated using an INSERT command while the database retains a quorum. For required database modifications, please open a Support Request with Broadcom Technical Support and cite this Article ID (436198) in the problem description. For more information, see Creating and managing Broadcom support cases.

Additional Information