Tanzu Hub upgrade fails due to ensemble-observability-store authentication failures to clickhouse database

Products

VMware Tanzu Platform Core Tanzu Hub SaaS

Issue/Introduction

Tanzu Upgrade from 10.3.x to 10.4.1 fails during apply changes.

Apply changes logs indicates a timeout with ensemble-observability-store pod:

   Timed out waiting after 25m0s for resources: [deployment/ensemble-observability-store (apps/v1) namespace: tanzusm]

Ensemble observability store logs show an authentication error when connecting to clickhouse database:

ensemble-observability-store.log

Failed connecting to clickhouse: [TssClickHouseConnectionException] with error: Auth failed connecting to clickhouse: Code: 516. DB::Exception: observability_user: Authentication failed:  password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED)  com.vmware.aria.tss.server.exception.TssClickHouseConnectionException: Auth failed connecting to clickhouse: Code: 516. DB::Exception: observability_user: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED)

Cause

This issue is caused by a loss in quorum of clickhouse. The user/roles are not propagated across all clickhouse-shards leading to authentication failures.

Resolution

The workaround is to manually rectify this inconsistency by updating clickhouse database -

1.) BOSH ssh to the Tanzu Hub registry VM -

bosh -d <Hub deployment> ssh registry/0

2.) Create a bash session to clickhouse metrics pod:

kubectl -n tanzusm exec -it <click-house-metrics Pod> — bash

3.) Connect to the clickhouse database by getting password from env variables inside of pod:

env|grep PASS
clickhouse-client

4.) Perform a count of the system.roles and system.users across different clickhouse pods.

select hostName(), name from clusterAllReplicas('default', system.roles) group by hostName(), name order by hostName(), name;
select hostName(), count() from clusterAllReplicas('default', system.roles) group by hostName()
select hostName(), count() from clusterAllReplicas('default', system.users) group by hostName() ;

If you see an inconsistent count across the clickhouse pods then that's indicative that you are affected by issue in this KB.

Example:

5.) For any inconsistent users, drop the user and recreate using password from env variable (obtained in prior step)

DROP USER IF EXISTS observability_user on cluster 'default';
CREATE USER IF NOT EXISTS observability_user  on cluster 'default' IDENTIFIED BY '<db_password>';
 
DROP USER IF EXISTS observability_remote_query_user on cluster 'default';
CREATE USER IF NOT EXISTS observability_remote_query_user on cluster 'default' IDENTIFIED  BY '<db_password>';

Verify that the user count is now consistent:

select hostName(), count() from clusterAllReplicas('default', system.users) group by hostName() ;

6.) Now run the following create command to update roles so that they are consistent across pods:

CREATE ROLE IF NOT EXISTS log_interactive_reader on cluster 'default';
CREATE ROLE IF NOT EXISTS log_background_reader on cluster 'default';
CREATE ROLE IF NOT EXISTS log_writer on cluster 'default';
CREATE ROLE IF NOT EXISTS metric_interactive_reader on cluster 'default';
CREATE ROLE IF NOT EXISTS metric_background_reader on cluster 'default';
CREATE ROLE IF NOT EXISTS metric_writer on cluster 'default';
CREATE ROLE IF NOT EXISTS event_interactive_reader on cluster 'default';
CREATE ROLE IF NOT EXISTS event_background_reader on cluster 'default';
CREATE ROLE IF NOT EXISTS event_writer on cluster 'default';
CREATE ROLE IF NOT EXISTS trace_interactive_reader on cluster 'default';
CREATE ROLE IF NOT EXISTS trace_writer on cluster 'default';
CREATE ROLE IF NOT EXISTS remote_query_reader on cluster 'default';

Verify that the role count is now consistent:

select hostName(), count() from clusterAllReplicas('default', system.roles) group by hostName() ;

After performing this, the ensemble-observability-store pod should automatically reconcile. Please allow upwards of 15 minutes for system to reconcile after performing fix.

This issue will be addressed in a future release of Tanzu Hub.