Tanzu Hub Install/Upgrade - Few Clickhouse pods are in CrashLoopBackOff state during scaling from 3 to 7 shards

search cancel

Tanzu Hub Install/Upgrade - Few Clickhouse pods are in CrashLoopBackOff state during scaling from 3 to 7 shards

book

Article ID: 441087

calendar_today

Updated On:

Products

VMware Tanzu Platform - Hub

Issue/Introduction

Few ClickHouse pods go to CrashLoopBackOff state during scaling from 3 to 7 shards

How to Check for the Issue
Step 1: Verify ClickHouse pod status

kubectl get pods -n tanzusm -l app=clickhouse-op -o wide

Example output
chi-clickhouse-metrics-default-0-0-0                  1/1     Running            6 (12m ago)   25m
chi-clickhouse-metrics-default-1-0-0                  0/1     CrashLoopBackOff   8 (59s ago)   106m
chi-clickhouse-metrics-default-2-0-0                  0/1     CrashLoopBackOff   8 (61s ago)   24m
chi-clickhouse-metrics-default-3-0-0                  1/1     Running            0             24m
chi-clickhouse-metrics-default-4-0-0                  1/1     Running            0             24m
chi-clickhouse-metrics-default-5-0-0                  1/1     Running            0             22m
chi-clickhouse-metrics-default-6-0-0                  1/1     Running            0             22m

Step 2: Check the clickhouse pod logs which are in CrashLoopBackOff

kubectl logs -n tanzusm chi-clickhouse-metrics-default-1-0-0

Below error will be observed

2026.05.01 20:19:40.193488 [ 636 ] {} <Error> RaftInstance: session 67789 failed to process request message due to error: Trying to rollback invalid ZXID (70383435). It should be the last preprocessed.
2026.05.01 20:19:40.193664 [ 636 ] {} <Error> ForcedCriticalErrorsLogger: Code: 49. DB::Exception: Trying to rollback invalid ZXID (70383435). It should be the last preprocessed. (LOGICAL_ERROR), Stack trace (when copying this message, always include the lines below):

0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x00000000166dc66a
1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000e55bf0e
2. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x000000000e55b849

Environment

Tanzu Hub

Cause

An uncommitted cluster configuration change is the root cause. When scaling ClickHouse Keeper from 3 to 7 nodes, the joint consensus config entry was written to the Raft log by the original leader, but the leader failed before the old quorum could commit it.

This split the cluster into two incompatible quorum worlds: old nodes continued operating under the original smaller config (quorum = minority+1 of the original size), while new nodes booted with the expanded config (quorum = majority of the new size).

The new nodes - carrying only a small fraction of the log entries that the old nodes had accumulated, and with no path to receive the old nodes' compacted snapshot — elected their own stale leader among themselves, entered an irrecoverable log-match deadlock with the old nodes, and began rejecting all Keeper client sessions whose transaction IDs were ahead of the stale state. The resulting continuous leader-churn across the two incompatible quorums violated the preprocessing queue LIFO rollback invariant inside ClickHouse Keeper, causing one of the old nodes to crash with a LOGICAL_ERROR (Code: 49) and enter CrashLoopBackOff.

Resolution

To resolve this issue, the attached ch-remediate.sh script can be run from a jumphost.

Alternatively the steps outlined below can be run manually.

Step 1: SSH to the registry VM

SSH to registry VM to run kubectl commands, KUBECONFIG with admin access is set by default

bosh -d <Hub Deployment> ssh registry

Verify connectivity

kubectl cluster-info
kubectl get nodes -n tanzusm

Step 2: Pause Carvel Packages

kctrl package installed pause -i ensemble-helm -n tanzusm -y
kctrl package installed pause -i clickhouse-metrics -n tanzusm -y
kctrl package installed pause -i sm -n tanzusm -y

Verify packages are paused

kctrl app list -n tanzusm

Step 3: Stop clickhouse-metrics (chi)

This will gracefully terminate all ClickHouse pods so PVCs are released and safe to clean.

kubectl -n tanzusm patch chi clickhouse-metrics \
  --type=merge -p '{"spec":{"stop":"yes"}}'

Wait for all pods to terminate (up to 5 minutes)

watch kubectl -n tanzusm get pods \
  -l "clickhouse.altinity.com/chi=clickhouse-metrics"

All pods must disappear.

Note: Do not proceed to the next step until the pod count is 0. If pods are stuck terminating after 5 minutes, investigate with:

kubectl -n tanzusm describe pod <pod-name>

Step 4: Clear Keeper Data

This will remove corrupted or stale Keeper state files (coordination logs, snapshots) so Keeper starts fresh. Table data (data directories) is never touched.
Path being cleaned: /var/vcap/store/pvc-*/clickhouse-keeper/

bosh -d <Hub Deployment> vms | grep clickhouse-metrics

For each clickhouse-metrics VM index (e.g., 0 through 6), run:

for INDEX in $(seq 0 6); do
  echo "=== Cleaning clickhouse-metrics/$INDEX ==="
  bosh -d $BOSH_DEPLOYMENTS ssh "clickhouse-metrics/$INDEX" --command     'found=0
     while IFS= read -r keeper_dir; do
       echo "Clearing: $keeper_dir"
       rm -rf "$keeper_dir"/*
       echo "Cleared."
       found=1
     done < <(find /var/vcap/store -maxdepth 4 -name clickhouse-keeper -type d 2>/dev/null)
     [[ $found -eq 0 ]] && echo "No clickhouse-keeper directory found"
     echo "=== Done ==="'
done

Verify on any one VM

bosh -d $BOSH_DEPLOYMENTS ssh "clickhouse-metrics/0" --command   'find /var/vcap/store -maxdepth 4 -name clickhouse-keeper -type d 2>/dev/null    | while read d; do echo "--- $d ---"; ls -lh "$d"; done; echo done'

The directories exist but are empty (no coordination/, snapshots/ files)

Step 5: Run from the registry VM. This re-enables the CHI operator to start all pods with fresh Keeper state.

kubectl -n tanzusm patch chi clickhouse-metrics \
  --type=merge -p '{"spec":{"stop":"no"}}'

Wait for shards 0-4 to become Ready (up to 8 minutes):

watch kubectl -n tanzusm get pods \
  -l "clickhouse.altinity.com/chi=clickhouse-metrics"

SHARD_COUNT=7
for s in $(seq 0 $((SHARD_COUNT - 1))); do
  echo -n "shard-$s: "
  kubectl -n tanzusm get pod "chi-clickhouse-metrics-default-${s}-0-0" \
    -o jsonpath='{.status.containerStatuses[0].ready}' 2>/dev/null || echo "not found"
  echo
done

Step 6: Verify Keeper Leader Election

NAMESPACE="tanzusm"; SHARD_COUNT=7; 

for s in $(seq 0 $((SHARD_COUNT - 1))); do pod="chi-clickhouse-metrics-default-${s}-0-0"; echo "===== $pod ====="; kubectl -n "$NAMESPACE" exec "$pod" -c clickhouse -- clickhouse-keeper-client -h 127.0.0.1 -p 2181 -q "mntr" --history-file=/dev/null 2>/dev/null | grep "zk_server_state" | awk '{print "Keeper State: " $2}'; kubectl -n "$NAMESPACE" exec "$pod" -c clickhouse -- clickhouse-keeper-client -h 127.0.0.1 -p 2181 -q "stat" --history-file=/dev/null 2>/dev/null | grep -E "(Mode:|Node count:)" || echo "Failed"; echo; done

Expected output: Exactly one shard shows leader, the others show follower.

If no leader is found, wait 10 seconds and retry. Allow up to 3 minutes total.

Step 7: Check table count on each shard

SHARD_COUNT=7
CH_PASS=$(kubectl -n tanzusm exec chi-clickhouse-metrics-default-0-0-0 \
  -c clickhouse -- bash -c 'echo "$CLICKHOUSE_ADMIN_PASSWORD"' 2>/dev/null | tr -d '')


for s in $(seq 0 $((SHARD_COUNT - 1))); do
 pod="chi-clickhouse-metrics-default-${s}-0-0"
  ready=$(kubectl -n tanzusm get pod "$pod" \
    -o jsonpath='{.status.containerStatuses[0].ready}' 2>/dev/null || echo "false")
  if [[ "$ready" != "true" ]]; then
    echo "shard-$s: NOT READY — skip"
    continue
  fi
  count=$(kubectl -n tanzusm exec "$pod" -c clickhouse -- \
    clickhouse-client -u default --password="$CH_PASS" \
    --query="SELECT count() FROM system.tables WHERE database='cdb_hc'" 2>/dev/null || echo "0")
  echo "shard-$s: $count tables"
done

Expected output: Every shard should have the same count of the tables

Step 7a: Fix a shard with missing tables

If any shard reports fewer tables than the others then you need to run this step.

# 1. Fetch all CREATE statements from shard-0 (reference)
CH_PASS=$(kubectl -n tanzusm exec chi-clickhouse-metrics-default-0-0-0 \
  -c clickhouse -- bash -c 'echo "$CLICKHOUSE_ADMIN_PASSWORD"' 2>/dev/null | tr -d '')

CREATE_STMTS=$(kubectl -n tanzusm exec "$REF_POD" -c clickhouse -- \
  clickhouse-client -u default --password="$CH_PASS" \
  --query="SELECT DISTINCT replaceRegexpOne(
      concat(create_table_query, ';'),
      'CREATE (TABLE|VIEW|MATERIALIZED VIEW|DICTIONARY|LIVE VIEW|WINDOW VIEW)',
      'CREATE \1 IF NOT EXISTS')
    FROM clusterAllReplicas('default', system.tables)
    WHERE database = 'cdb_hc'
      AND create_table_query != ''
      AND name NOT LIKE '.inner.%'
      AND name NOT LIKE '.inner_id.%'
    ORDER BY multiIf(engine LIKE '%MergeTree', 1, engine='Distributed', 2,
      engine='Dictionary', 3, engine='View', 4, engine='MaterializedView', 5, 6)
    SETTINGS skip_unavailable_shards=1,
             show_table_uuid_in_table_create_query_if_not_nil=1
    FORMAT TSVRaw" 2>/dev/null)

# 2. Replay each statement on the broken shard (replace N with shard index. Mostly 5 and 6 )
BROKEN_SHARD=N
BROKEN_POD="chi-clickhouse-metrics-default-${BROKEN_SHARD}-0-0"

while IFS= read -r stmt; do
  [[ -z "$stmt" ]] && continue
  kubectl -n tanzusm exec "$BROKEN_POD" -c clickhouse -- \
    clickhouse-client -u default --password="$CH_PASS" \
    --query="$stmt" 2>/dev/null || true
done <<< "$CREATE_STMTS"

# 3. Verify
kubectl -n tanzusm exec "$BROKEN_POD" -c clickhouse -- \
  clickhouse-client -u default --password="$CH_PASS" \
  --query="SELECT count() FROM system.tables WHERE database='cdb_hc'"
# Expected: Numbers should be same on all the shards

Step 7b: Recreate dictionaries

Run these commands only of shard 5 and 6 only

# Adjust to target specific shards if needed
SHARD_LIST="5 6"
REF_POD="chi-clickhouse-metrics-default-0-0-0"

CREATE_FNS=$(kubectl -n tanzusm exec "$REF_POD" -c clickhouse -- \
  clickhouse-client -u default --password "$CH_PASS" \
  --query="SELECT DISTINCT replaceRegexpOne(concat(create_query, ';'), 'CREATE (FUNCTION)', 'CREATE \1 IF NOT EXISTS') FROM clusterAllReplicas('default', system.functions) WHERE create_query != '' SETTINGS skip_unavailable_shards=1 FORMAT TSVRaw" 2>/dev/null)

for N in $SHARD_LIST; do
  POD="chi-clickhouse-metrics-default-${N}-0-0"
  echo "=== UDFs on shard-$N ==="
  while IFS= read -r stmt; do
    [[ -z "$stmt" ]] && continue
    kubectl -n tanzusm exec "$POD" -c clickhouse -- \
      clickhouse-client -u default --password "$CH_PASS" \
      --query="$stmt" 2>/dev/null || true
  done <<< "$CREATE_FNS"
  echo "shard-$N: UDFs done"
done

cat > /tmp/agg_dict.sql << 'SQLEOF'
DROP DICTIONARY IF EXISTS cdb_hc.agg_dict;
CREATE DICTIONARY IF NOT EXISTS cdb_hc.agg_dict
(
    `__name__` String,
    `__domain__` String,
    `aggregator` String,
    `target_domain` String,
    `enable` Bool
)
PRIMARY KEY __name__, __domain__
SOURCE(CLICKHOUSE(TABLE 'distributed_aggregate_metrics_metadata' USER 'clickhouse' PASSWORD '__CHPASS__'))
LIFETIME(MIN 0 MAX 300)
LAYOUT(COMPLEX_KEY_HASHED_ARRAY());
SQLEOF
# Substitute actual password into the file
sed -i "s/__CHPASS__/${CH_PASS}/g" agg_dict.sql

# Apply on each target shard via stdin (kubectl exec -i)
for N in $SHARD_LIST; do
  POD="chi-clickhouse-metrics-default-${N}-0-0"
  echo "=== agg_dict on shard-$N ==="
  kubectl -n tanzusm exec -i "$POD" -c clickhouse -- \
    clickhouse-client -u default --password "$CH_PASS" \
    < /tmp/agg_dict.sql \
    && echo "shard-$N: done" \
    || echo "shard-$N: FAILED — check error above"
done

Step 8: Final Verification and Read-Only Replica Fix

Check for the distributed table count across all shards

kubectl -n tanzusm exec chi-clickhouse-metrics-default-0-0-0 -c clickhouse -- \
  clickhouse-client -u default --password "$CH_PASS" \
  --query="SELECT hostName(), count()
           FROM clusterAllReplicas('default', system.tables)
           WHERE database='cdb_hc'
           GROUP BY 1 ORDER BY 1
           SETTINGS skip_unavailable_shards=1"

Expected: Number of tables should be same on all the shards

Step 8a: Check for read-only replicas

SHARD_COUNT=7
for s in $(seq 0 $((SHARD_COUNT - 1))); do
  pod="chi-clickhouse-metrics-default-${s}-0-0"
  ready=$(kubectl -n tanzusm get pod "$pod" \
    -o jsonpath='{.status.containerStatuses[0].ready}' 2>/dev/null || echo "false")
  [[ "$ready" != "true" ]] && echo "  shard-$s: NOT READY" && continue
  ro=$(kubectl -n tanzusm exec "$pod" -c clickhouse -- \
    clickhouse-client -u default --password "$CH_PASS" \
    --query="SELECT count() FROM system.replicas WHERE is_readonly=1" \
    2>/dev/null || echo "?")
  echo "  shard-$s: $ro read-only table(s)"
done

Expected: Count should be 0 on every shard.

Step 8b: Fix read-only replicas

This step is needed only when any shard shows read-only tables i.e with value 1

SHARD_COUNT=7

for s in $(seq 0 $((SHARD_COUNT - 1))); do
  POD="chi-clickhouse-metrics-default-${s}-0-0"
  ready=$(kubectl -n tanzusm get pod "$POD" \
    -o jsonpath='{.status.containerStatuses[0].ready}' 2>/dev/null || echo "false")
  [[ "$ready" != "true" ]] && echo "shard-$s: NOT READY — skipping" && continue

  ro=$(kubectl -n tanzusm exec "$POD" -c clickhouse -- \
    clickhouse-client -u default --password "$CH_PASS" \
    --query="SELECT count() FROM system.replicas WHERE is_readonly=1" 2>/dev/null || echo "0")
  [[ "$ro" == "0" ]] && echo "shard-$s: OK" && continue

  # ── Pass 1: SYSTEM RESTART REPLICA (all tables in one multiquery) ──
  echo "shard-$s: $ro read-only — Pass 1: SYSTEM RESTART REPLICA..."
  kubectl -n tanzusm exec "$POD" -c clickhouse -- \
    clickhouse-client -u default --password "$CH_PASS" \
    --query="SELECT concat('SYSTEM RESTART REPLICA ', database, '.', \`table\`, ';') FROM system.replicas WHERE is_readonly=1 FORMAT TSVRaw" 2>/dev/null \
  | kubectl -n tanzusm exec -i "$POD" -c clickhouse -- \
    clickhouse-client -u default --password "$CH_PASS" --multiquery

  sleep 10
  ro=$(kubectl -n tanzusm exec "$POD" -c clickhouse -- \
    clickhouse-client -u default --password "$CH_PASS" \
    --query="SELECT count() FROM system.replicas WHERE is_readonly=1" 2>/dev/null || echo "?")
  [[ "$ro" == "0" ]] && echo "shard-$s: fixed after Pass 1" && continue

  # ── Pass 2: SYSTEM RESTORE REPLICA (only if Pass 1 left tables read-only) ──
  echo "shard-$s: $ro still read-only — Pass 2: SYSTEM RESTORE REPLICA..."
  kubectl -n tanzusm exec "$POD" -c clickhouse -- \
    clickhouse-client -u default --password "$CH_PASS" \
    --query="SELECT concat('SYSTEM RESTORE REPLICA ', database, '.', \`table\`, ';') FROM system.replicas WHERE is_readonly=1 AND engine IN ('ReplicatedMergeTree','ReplicatedReplacingMergeTree','ReplicatedAggregatingMergeTree','ReplicatedSummingMergeTree','ReplicatedCollapsingMergeTree','ReplicatedVersionedCollapsingMergeTree','ReplicatedGraphiteMergeTree') FORMAT TSVRaw" 2>/dev/null \
  | kubectl -n tanzusm exec -i "$POD" -c clickhouse -- \
    clickhouse-client -u default --password "$CH_PASS" --multiquery

  sleep 10
  ro=$(kubectl -n tanzusm exec "$POD" -c clickhouse -- \
    clickhouse-client -u default --password "$CH_PASS" \
    --query="SELECT count() FROM system.replicas WHERE is_readonly=1" 2>/dev/null || echo "?")
  echo "shard-$s: final read-only count: $ro"
done

echo ""
echo "=== Summary ==="
for s in $(seq 0 $((SHARD_COUNT - 1))); do
  POD="chi-clickhouse-metrics-default-${s}-0-0"
  ro=$(kubectl -n tanzusm exec "$POD" -c clickhouse -- \
    clickhouse-client -u default --password "$CH_PASS" \
    --query="SELECT count() FROM system.replicas WHERE is_readonly=1" 2>/dev/null || echo "?")
  echo "  shard-$s: $ro read-only"
done
# Expected: 0 on every shard

Step 8c: Create roles and users if missing

Run the below commands on shard 0

kubectl -n tanzusm exec chi-clickhouse-metrics-default-0-0-0 -c clickhouse -- \
  clickhouse-client -u default --password="$CH_PASS" \
  --multiquery --query="
    CREATE ROLE IF NOT EXISTS log_interactive_reader ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS log_background_reader ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS log_writer ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS metric_interactive_reader ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS metric_background_reader ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS metric_writer ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS event_interactive_reader ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS event_background_reader ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS event_writer ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS trace_interactive_reader ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS trace_writer ON CLUSTER 'default';
    CREATE ROLE IF NOT EXISTS remote_query_reader ON CLUSTER 'default';
  "

kubectl -n tanzusm exec chi-clickhouse-metrics-default-0-0-0 -c clickhouse -- \
  clickhouse-client -u default --password="$CH_PASS" \
  --multiquery --query="
    DROP USER IF EXISTS observability_user on cluster 'default';
    CREATE USER IF NOT EXISTS observability_user on cluster 'default' IDENTIFIED BY '$CLICKHOUSE_ADMIN_PASSWORD';
    DROP USER IF EXISTS observability_remote_query_user on cluster 'default';
    CREATE USER IF NOT EXISTS observability_remote_query_user ON CLUSTER 'default' IDENTIFIED BY '';
  "

Step 9: Resume carvel package reconciliation

kctrl package installed kick -i ensemble-helm -n tanzusm -y
kctrl package installed kick -i clickhouse-metrics -n tanzusm -y
kctrl package installed kick -i sm -n tanzusm -y

Verify packages are running

kctrl app list -n tanzusm

Attachments

ch-remediate.sh get_app

Feedback

thumb_up Yes

thumb_down No