DBA is concerned in regards to the large partition of oauth_refresh_token_view_client_key which caused CPU spikes and long stream among the Cassandra nodes during Cassandra repair process.
This issue could cause a potential Prod outage due to high CPU spike on Cassandra Nodes as it attempt to repair table used for OTK
What is known:
1. oauth_refresh_token_view_client_key is used as part of the Authorization Code flow, where clients were given Refresh Token so they can generate Access Token without requiring authorization code flow.
2. Some of the Clients could have up to 10 million Resource Owner, and Refresh Token TTL of 1.5 years.
3. oauth_refresh_token_view_client_key and the use case where we have a single client_key with potentially up to 10 million records and long lived TTL on the refresh token. Due to the way the table is setup, the partition key is tied to a single key, and generates a really big SSTable during repair and compaction (estimated at 147GB) which then get streamed to the other nodes to satisfy the Replication Factor. The process of streaming this huge data caused spike in CPU usage on the nodes.
Work Done
Several suggestions were given by the Cassandra supports to remediate the situation. One of the suggestion given was to update oauth_refresh_token_view_client_key table schema, so that the partition key used is more unique.