Tanzu Upgrade from 10.3.x to 10.4.1 fails during apply changes.
Apply changes logs indicates a timeout with ensemble-observability-store pod:
Timed out waiting after 25m0s for resources: [deployment/ensemble-observability-store (apps/v1) namespace: tanzusm]
Ensemble observability store logs show an authentication error when connecting to clickhouse database:
ensemble-observability-store.log
Failed connecting to clickhouse: [TssClickHouseConnectionException] with error: Auth failed connecting to clickhouse: Code: 516. DB::Exception: observability_user: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED) com.vmware.aria.tss.server.exception.TssClickHouseConnectionException: Auth failed connecting to clickhouse: Code: 516. DB::Exception: observability_user: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED)
This issue is caused by a loss in quorum of clickhouse. The user/roles are not propagated across all clickhouse-shards leading to authentication failures.
The workaround is to manually rectify this inconsistency by updating clickhouse database -
1.) BOSH ssh to the Tanzu Hub registry VM -
bosh -d <Hub deployment> ssh registry/0
2.) Create a bash session to clickhouse metrics pod:
kubectl -n tanzusm exec -it <click-house-metrics Pod> — bash
3.) Connect to the clickhouse database by getting password from env variables inside of pod:
env|grep PASS
clickhouse-client
4.) Perform a count of the system.roles and system.users across different clickhouse pods.
select hostName(), name from clusterAllReplicas('default', system.roles) group by hostName(), name order by hostName(), name;
select hostName(), count() from clusterAllReplicas('default', system.roles) group by hostName()
select hostName(), count() from clusterAllReplicas('default', system.users) group by hostName() ;If you see an inconsistent count across the clickhouse pods then that's indicative that you are affected by issue in this KB.
Example:
5.) For any inconsistent users, drop the user and recreate using password from env variable (obtained in prior step)
DROP USER IF EXISTS observability_user on cluster 'default';
CREATE USER IF NOT EXISTS observability_user on cluster 'default' IDENTIFIED BY '<db_password>';
DROP USER IF EXISTS observability_remote_query_user on cluster 'default';
CREATE USER IF NOT EXISTS observability_remote_query_user on cluster 'default' IDENTIFIED BY '<db_password>';
Verify that the user count is now consistent:
select hostName(), count() from clusterAllReplicas('default', system.users) group by hostName() ;6.) Now run the following create command to update roles so that they are consistent across pods:
CREATE ROLE IF NOT EXISTS log_interactive_reader on cluster 'default';
CREATE ROLE IF NOT EXISTS log_background_reader on cluster 'default';
CREATE ROLE IF NOT EXISTS log_writer on cluster 'default';
CREATE ROLE IF NOT EXISTS metric_interactive_reader on cluster 'default';
CREATE ROLE IF NOT EXISTS metric_background_reader on cluster 'default';
CREATE ROLE IF NOT EXISTS metric_writer on cluster 'default';
CREATE ROLE IF NOT EXISTS event_interactive_reader on cluster 'default';
CREATE ROLE IF NOT EXISTS event_background_reader on cluster 'default';
CREATE ROLE IF NOT EXISTS event_writer on cluster 'default';
CREATE ROLE IF NOT EXISTS trace_interactive_reader on cluster 'default';
CREATE ROLE IF NOT EXISTS trace_writer on cluster 'default';
CREATE ROLE IF NOT EXISTS remote_query_reader on cluster 'default';
Verify that the role count is now consistent:
select hostName(), count() from clusterAllReplicas('default', system.roles) group by hostName() ;After performing this, the ensemble-observability-store pod should automatically reconcile. Please allow upwards of 15 minutes for system to reconcile after performing fix.
This issue will be addressed in a future release of Tanzu Hub.