Tanzu Hub Install/Upgrade - Observability Store Pods CrashLoopBackOff after HA-enabled environment upgrade
search cancel

Tanzu Hub Install/Upgrade - Observability Store Pods CrashLoopBackOff after HA-enabled environment upgrade

book

Article ID: 442516

calendar_today

Updated On:

Products

VMware Tanzu Platform - Hub

Issue/Introduction

Observability store pods enter a CrashLoopBackOff state following an upgrade in an environment where High Availability (HA) is enabled.
This issue occurs when upgrading from any previous version to a newer version, up to and including 10.4.1.

 

How to Check for the Issue

Step 1: Check the Hub install logs in Opsman
You will see errors like below:

[x] Installation failed with error: timed out waiting for PackageInstall to reconcile. Last failure: kapp: Error: waiting on reconcile packageinstall/ensemble-helm (packaging.carvel.dev/v1alpha1) namespace: tanzusm:
Finished waiting unsuccessfully:
Reconcile failed: message: kapp:
Error:
Timed out waiting after 25m0s for resources: [deployment/ensemble-observability-store (apps/v1) namespace: tanzusm]

 

Step 2: Check the observability store pod logs

kubectl logs -n tanzusm ensemble-observability-store-<PodID> 

1. The error below will be logged in the store pods that are running

23 ] : Failed connecting to clickhouse: [TssClickHouseConnectionException] with error: Auth failed connecting to clickhouse: Code: 516. DB::Exception: observability_user: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED) com.vmware.aria.tss.server.exception.TssClickHouseConnectionException: Auth failed connecting to clickhouse: Code: 516. DB::Exception: observability_user: Authentication failed: password is incorrect, or there is no user with such name. (AUTHENTICATION_FAILED)

2. The error below will be logged in the store pods which are in CrashLoopBackoff

2026-05-22T16:41:05,297 INFO  [Thread-2] c.v.a.t.s.r.ClickHouseSchemaManager <> : Executing DDL on individual shards and replicas : 3, 2 
2026-05-22T16:41:05,297 INFO  [Thread-2] c.v.a.t.s.r.ClickHouseSchemaManager <> : Using serviceUrl chi-clickhouse-metrics-default-0-0-0.service-clickhouse-metrics-default-0-0 for shard 0 replica 0 and replication enabled is : true 
2026-05-22T16:41:05,336 INFO  [Thread-2] c.v.a.t.s.u.SchemaManagerUtils <> : remoteUserWithStandardPasswordEnabled is set to false 
2026-05-22T16:41:05,860 INFO  [Thread-2] l.l.c.JavaLogger <> : Reading from default.DATABASECHANGELOG 
2026-05-22T16:41:06,741 INFO  [Thread-2] l.l.c.JavaLogger <> : Successfully acquired change log lock 
2026-05-22T16:41:06,757 INFO  [Thread-2] l.l.c.JavaLogger <> : Using deploymentId: 9468052172 
2026-05-22T16:41:06,764 INFO  [Thread-2] l.l.c.JavaLogger <> : Reading from default.DATABASECHANGELOG 
Running Changeset: liquibase/v2/clickhouse-rbac-roles.sql::3::shrishac
2026-05-22T16:41:07,100 ERROR [Thread-2] l.l.c.JavaLogger <> : ChangeSet liquibase/v2/clickhouse-rbac-roles.sql::3::shrishac encountered an exception. liquibase.exception.DatabaseException: Code: 511. DB::Exception: There was an error on [chi-clickhouse-metrics-default-0-1-0.service-clickhouse-metrics-default-0-1.tanzusm.svc.cluster.local:9000]: Code: 511. DB::Exception: There is no role `event_interactive_reader` in `user directories`. (UNKNOWN_ROLE) (version 26.3.9.8 (official build)). (UNKNOWN_ROLE) (version 26.3.9.8 (official build))  [Failed SQL: (0) GRANT ON CLUSTER 'default' SELECT ON cdb_hc.events_custom TO event_interactive_reader, event_background_reader]

 

3. The widgets in Hub will show an UNAUTHENTICATED error.

Environment

Tanzu Hub HA

Cause

When upgrading to version 10.4.1 or earlier in High Availability (HA) environments, phased rollout operations trigger a restart of the replicas. During this process, roles do not automatically migrate to the restarted pods. Since this behavior is not recognized as a scale-out event, the observability store fails to account for the change. To facilitate proper changelog processing, administrators must manually initialize these roles on the newly created replicas.

Resolution

Step 1: Create roles and users if missing

Run the below commands on shard 0

CH_PASS=$(kubectl -n tanzusm exec chi-clickhouse-metrics-default-0-0-0 \
  -c clickhouse -- bash -c 'echo "$CLICKHOUSE_ADMIN_PASSWORD"' 2>/dev/null | tr -d '
')

# Create Users
kubectl -n tanzusm exec chi-clickhouse-metrics-default-0-0-0 -c clickhouse -- \
  clickhouse-client -u default --password="$CH_PASS" \
  --multiquery --query="
    DROP USER IF EXISTS observability_user on cluster 'default';
    CREATE USER IF NOT EXISTS observability_user on cluster 'default' IDENTIFIED BY '$CH_PASS';
    DROP USER IF EXISTS observability_remote_query_user on cluster 'default';
    CREATE USER IF NOT EXISTS observability_remote_query_user ON CLUSTER 'default' IDENTIFIED BY '$CH_PASS';
  "
# Create Roles
kubectl -n tanzusm exec chi-clickhouse-metrics-default-0-0-0 -c clickhouse -- \
  clickhouse-client -u default --password="$CH_PASS" \
  --multiquery --query="
    CREATE ROLE IF NOT EXISTS log_interactive_reader ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS log_background_reader ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS log_writer ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS metric_interactive_reader ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS metric_background_reader ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS metric_writer ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS event_interactive_reader ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS event_background_reader ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS event_writer ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS trace_interactive_reader ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS trace_writer ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS remote_query_reader ON CLUSTER 'default';
  "

 

Step 2: Restart the ensemble-observability-store Pod

kubectl -n tanzusm rollout restart ensemble-observability-store
kubectl -n tanzusm get po | grep store

Make sure that all the store pods are in running state and ready, after that open the Hub UI and see that the UNAUTHENTICATED errors are gone and the data is loading.