Few ClickHouse pods go to CrashLoopBackOff state during scaling from 3 to 7 shards
How to Check for the Issue
Step 1: Verify ClickHouse pod status
kubectl get pods -n tanzusm -l app=clickhouse-op -o wide
Example output
chi-clickhouse-metrics-default-0-0-0 1/1 Running 6 (12m ago) 25m
chi-clickhouse-metrics-default-1-0-0 0/1 CrashLoopBackOff 8 (59s ago) 106m
chi-clickhouse-metrics-default-2-0-0 0/1 CrashLoopBackOff 8 (61s ago) 24m
chi-clickhouse-metrics-default-3-0-0 1/1 Running 0 24m
chi-clickhouse-metrics-default-4-0-0 1/1 Running 0 24m
chi-clickhouse-metrics-default-5-0-0 1/1 Running 0 22m
chi-clickhouse-metrics-default-6-0-0 1/1 Running 0 22m
Step 2: Check the clickhouse pod logs which are in CrashLoopBackOff
kubectl logs -n tanzusm chi-clickhouse-metrics-default-1-0-0Below error will be observed
2026.05.01 20:19:40.193488 [ 636 ] {} <Error> RaftInstance: session 67789 failed to process request message due to error: Trying to rollback invalid ZXID (70383435). It should be the last preprocessed.
2026.05.01 20:19:40.193664 [ 636 ] {} <Error> ForcedCriticalErrorsLogger: Code: 49. DB::Exception: Trying to rollback invalid ZXID (70383435). It should be the last preprocessed. (LOGICAL_ERROR), Stack trace (when copying this message, always include the lines below):
0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x00000000166dc66a
1. DB::Exception::Exception(String&&, int, String, bool) @ 0x000000000e55bf0e
2. DB::Exception::Exception(PreformattedMessage&&, int) @ 0x000000000e55b849
Tanzu Hub
An uncommitted cluster configuration change is the root cause. When scaling ClickHouse Keeper from 3 to 7 nodes, the joint consensus config entry was written to the Raft log by the original leader, but the leader failed before the old quorum could commit it.
This split the cluster into two incompatible quorum worlds: old nodes continued operating under the original smaller config (quorum = minority+1 of the original size), while new nodes booted with the expanded config (quorum = majority of the new size).
The new nodes - carrying only a small fraction of the log entries that the old nodes had accumulated, and with no path to receive the old nodes' compacted snapshot — elected their own stale leader among themselves, entered an irrecoverable log-match deadlock with the old nodes, and began rejecting all Keeper client sessions whose transaction IDs were ahead of the stale state. The resulting continuous leader-churn across the two incompatible quorums violated the preprocessing queue LIFO rollback invariant inside ClickHouse Keeper, causing one of the old nodes to crash with a LOGICAL_ERROR (Code: 49) and enter CrashLoopBackOff.
To resolve this issue, the attached ch-remediate.sh script can be run from a jumphost.
Alternatively the steps outlined below can be run manually.
Step 1: SSH to the registry VM
SSH to registry VM to run kubectl commands, KUBECONFIG with admin access is set by default
bosh -d <Hub Deployment> ssh registryVerify connectivity
kubectl cluster-info
kubectl get nodes -n tanzusm
Step 2: Pause Carvel Packages
kctrl package installed pause -i ensemble-helm -n tanzusm -y
kctrl package installed pause -i clickhouse-metrics -n tanzusm -y
kctrl package installed pause -i sm -n tanzusm -yVerify packages are paused
kctrl app list -n tanzusm
Step 3: Stop clickhouse-metrics (chi)
This will gracefully terminate all ClickHouse pods so PVCs are released and safe to clean.
kubectl -n tanzusm patch chi clickhouse-metrics \
--type=merge -p '{"spec":{"stop":"yes"}}'Wait for all pods to terminate (up to 5 minutes)
watch kubectl -n tanzusm get pods \
-l "clickhouse.altinity.com/chi=clickhouse-metrics"All pods must disappear.
Note: Do not proceed to the next step until the pod count is 0. If pods are stuck terminating after 5 minutes, investigate with:
kubectl -n tanzusm describe pod <pod-name>
Step 4: Clear Keeper Data
This will remove corrupted or stale Keeper state files (coordination logs, snapshots) so Keeper starts fresh. Table data (data directories) is never touched.
Path being cleaned: /var/vcap/store/pvc-*/clickhouse-keeper/
bosh -d <Hub Deployment> vms | grep clickhouse-metricsFor each clickhouse-metrics VM index (e.g., 0 through 6), run:
for INDEX in $(seq 0 6); do
echo "=== Cleaning clickhouse-metrics/$INDEX ==="
bosh -d $BOSH_DEPLOYMENTS ssh "clickhouse-metrics/$INDEX" --command 'found=0
while IFS= read -r keeper_dir; do
echo "Clearing: $keeper_dir"
rm -rf "$keeper_dir"/*
echo "Cleared."
found=1
done < <(find /var/vcap/store -maxdepth 4 -name clickhouse-keeper -type d 2>/dev/null)
[[ $found -eq 0 ]] && echo "No clickhouse-keeper directory found"
echo "=== Done ==="'
doneVerify on any one VM
bosh -d $BOSH_DEPLOYMENTS ssh "clickhouse-metrics/0" --command 'find /var/vcap/store -maxdepth 4 -name clickhouse-keeper -type d 2>/dev/null | while read d; do echo "--- $d ---"; ls -lh "$d"; done; echo done'The directories exist but are empty (no coordination/, snapshots/ files)
Step 5: Run from the registry VM. This re-enables the CHI operator to start all pods with fresh Keeper state.
kubectl -n tanzusm patch chi clickhouse-metrics \
--type=merge -p '{"spec":{"stop":"no"}}'Wait for shards 0-4 to become Ready (up to 8 minutes):
watch kubectl -n tanzusm get pods \
-l "clickhouse.altinity.com/chi=clickhouse-metrics"
SHARD_COUNT=7
for s in $(seq 0 $((SHARD_COUNT - 1))); do
echo -n "shard-$s: "
kubectl -n tanzusm get pod "chi-clickhouse-metrics-default-${s}-0-0" \
-o jsonpath='{.status.containerStatuses[0].ready}' 2>/dev/null || echo "not found"
echo
done
Step 6: Verify Keeper Leader Election
NAMESPACE="tanzusm"; SHARD_COUNT=7;
for s in $(seq 0 $((SHARD_COUNT - 1))); do pod="chi-clickhouse-metrics-default-${s}-0-0"; echo "===== $pod ====="; kubectl -n "$NAMESPACE" exec "$pod" -c clickhouse -- clickhouse-keeper-client -h 127.0.0.1 -p 2181 -q "mntr" --history-file=/dev/null 2>/dev/null | grep "zk_server_state" | awk '{print "Keeper State: " $2}'; kubectl -n "$NAMESPACE" exec "$pod" -c clickhouse -- clickhouse-keeper-client -h 127.0.0.1 -p 2181 -q "stat" --history-file=/dev/null 2>/dev/null | grep -E "(Mode:|Node count:)" || echo "Failed"; echo; doneExpected output: Exactly one shard shows leader, the others show follower.
If no leader is found, wait 10 seconds and retry. Allow up to 3 minutes total.
Step 7: Check table count on each shard
SHARD_COUNT=7
CH_PASS=$(kubectl -n tanzusm exec chi-clickhouse-metrics-default-0-0-0 \
-c clickhouse -- bash -c 'echo "$CLICKHOUSE_ADMIN_PASSWORD"' 2>/dev/null | tr -d '')
for s in $(seq 0 $((SHARD_COUNT - 1))); do
pod="chi-clickhouse-metrics-default-${s}-0-0"
ready=$(kubectl -n tanzusm get pod "$pod" \
-o jsonpath='{.status.containerStatuses[0].ready}' 2>/dev/null || echo "false")
if [[ "$ready" != "true" ]]; then
echo "shard-$s: NOT READY — skip"
continue
fi
count=$(kubectl -n tanzusm exec "$pod" -c clickhouse -- \
clickhouse-client -u default --password="$CH_PASS" \
--query="SELECT count() FROM system.tables WHERE database='cdb_hc'" 2>/dev/null || echo "0")
echo "shard-$s: $count tables"
doneExpected output: Every shard should have the same count of the tables
Step 7a: Fix a shard with missing tables
If any shard reports fewer tables than the others then you need to run this step.
# 1. Fetch all CREATE statements from shard-0 (reference)
CH_PASS=$(kubectl -n tanzusm exec chi-clickhouse-metrics-default-0-0-0 \
-c clickhouse -- bash -c 'echo "$CLICKHOUSE_ADMIN_PASSWORD"' 2>/dev/null | tr -d '')
CREATE_STMTS=$(kubectl -n tanzusm exec "$REF_POD" -c clickhouse -- \
clickhouse-client -u default --password="$CH_PASS" \
--query="SELECT DISTINCT replaceRegexpOne(
concat(create_table_query, ';'),
'CREATE (TABLE|VIEW|MATERIALIZED VIEW|DICTIONARY|LIVE VIEW|WINDOW VIEW)',
'CREATE \1 IF NOT EXISTS')
FROM clusterAllReplicas('default', system.tables)
WHERE database = 'cdb_hc'
AND create_table_query != ''
AND name NOT LIKE '.inner.%'
AND name NOT LIKE '.inner_id.%'
ORDER BY multiIf(engine LIKE '%MergeTree', 1, engine='Distributed', 2,
engine='Dictionary', 3, engine='View', 4, engine='MaterializedView', 5, 6)
SETTINGS skip_unavailable_shards=1,
show_table_uuid_in_table_create_query_if_not_nil=1
FORMAT TSVRaw" 2>/dev/null)
# 2. Replay each statement on the broken shard (replace N with shard index. Mostly 5 and 6 )
BROKEN_SHARD=N
BROKEN_POD="chi-clickhouse-metrics-default-${BROKEN_SHARD}-0-0"
while IFS= read -r stmt; do
[[ -z "$stmt" ]] && continue
kubectl -n tanzusm exec "$BROKEN_POD" -c clickhouse -- \
clickhouse-client -u default --password="$CH_PASS" \
--query="$stmt" 2>/dev/null || true
done <<< "$CREATE_STMTS"
# 3. Verify
kubectl -n tanzusm exec "$BROKEN_POD" -c clickhouse -- \
clickhouse-client -u default --password="$CH_PASS" \
--query="SELECT count() FROM system.tables WHERE database='cdb_hc'"
# Expected: Numbers should be same on all the shards
Step 7b: Recreate dictionaries
Run these commands only of shard 5 and 6 only
# Adjust to target specific shards if needed
SHARD_LIST="5 6"
REF_POD="chi-clickhouse-metrics-default-0-0-0"
CREATE_FNS=$(kubectl -n tanzusm exec "$REF_POD" -c clickhouse -- \
clickhouse-client -u default --password "$CH_PASS" \
--query="SELECT DISTINCT replaceRegexpOne(concat(create_query, ';'), 'CREATE (FUNCTION)', 'CREATE \1 IF NOT EXISTS') FROM clusterAllReplicas('default', system.functions) WHERE create_query != '' SETTINGS skip_unavailable_shards=1 FORMAT TSVRaw" 2>/dev/null)
for N in $SHARD_LIST; do
POD="chi-clickhouse-metrics-default-${N}-0-0"
echo "=== UDFs on shard-$N ==="
while IFS= read -r stmt; do
[[ -z "$stmt" ]] && continue
kubectl -n tanzusm exec "$POD" -c clickhouse -- \
clickhouse-client -u default --password "$CH_PASS" \
--query="$stmt" 2>/dev/null || true
done <<< "$CREATE_FNS"
echo "shard-$N: UDFs done"
done
cat > /tmp/agg_dict.sql << 'SQLEOF'
DROP DICTIONARY IF EXISTS cdb_hc.agg_dict;
CREATE DICTIONARY IF NOT EXISTS cdb_hc.agg_dict
(
`__name__` String,
`__domain__` String,
`aggregator` String,
`target_domain` String,
`enable` Bool
)
PRIMARY KEY __name__, __domain__
SOURCE(CLICKHOUSE(TABLE 'distributed_aggregate_metrics_metadata' USER 'clickhouse' PASSWORD '__CHPASS__'))
LIFETIME(MIN 0 MAX 300)
LAYOUT(COMPLEX_KEY_HASHED_ARRAY());
SQLEOF
# Substitute actual password into the file
sed -i "s/__CHPASS__/${CH_PASS}/g" agg_dict.sql
# Apply on each target shard via stdin (kubectl exec -i)
for N in $SHARD_LIST; do
POD="chi-clickhouse-metrics-default-${N}-0-0"
echo "=== agg_dict on shard-$N ==="
kubectl -n tanzusm exec -i "$POD" -c clickhouse -- \
clickhouse-client -u default --password "$CH_PASS" \
< /tmp/agg_dict.sql \
&& echo "shard-$N: done" \
|| echo "shard-$N: FAILED — check error above"
done
Step 8: Final Verification and Read-Only Replica Fix
Check for the distributed table count across all shards
kubectl -n tanzusm exec chi-clickhouse-metrics-default-0-0-0 -c clickhouse -- \
clickhouse-client -u default --password "$CH_PASS" \
--query="SELECT hostName(), count()
FROM clusterAllReplicas('default', system.tables)
WHERE database='cdb_hc'
GROUP BY 1 ORDER BY 1
SETTINGS skip_unavailable_shards=1"Expected: Number of tables should be same on all the shards
Step 8a: Check for read-only replicas
SHARD_COUNT=7
for s in $(seq 0 $((SHARD_COUNT - 1))); do
pod="chi-clickhouse-metrics-default-${s}-0-0"
ready=$(kubectl -n tanzusm get pod "$pod" \
-o jsonpath='{.status.containerStatuses[0].ready}' 2>/dev/null || echo "false")
[[ "$ready" != "true" ]] && echo " shard-$s: NOT READY" && continue
ro=$(kubectl -n tanzusm exec "$pod" -c clickhouse -- \
clickhouse-client -u default --password "$CH_PASS" \
--query="SELECT count() FROM system.replicas WHERE is_readonly=1" \
2>/dev/null || echo "?")
echo " shard-$s: $ro read-only table(s)"
doneExpected: Count should be 0 on every shard.
Step 8b: Fix read-only replicas
This step is needed only when any shard shows read-only tables i.e with value 1
SHARD_COUNT=7
for s in $(seq 0 $((SHARD_COUNT - 1))); do
POD="chi-clickhouse-metrics-default-${s}-0-0"
ready=$(kubectl -n tanzusm get pod "$POD" \
-o jsonpath='{.status.containerStatuses[0].ready}' 2>/dev/null || echo "false")
[[ "$ready" != "true" ]] && echo "shard-$s: NOT READY — skipping" && continue
ro=$(kubectl -n tanzusm exec "$POD" -c clickhouse -- \
clickhouse-client -u default --password "$CH_PASS" \
--query="SELECT count() FROM system.replicas WHERE is_readonly=1" 2>/dev/null || echo "0")
[[ "$ro" == "0" ]] && echo "shard-$s: OK" && continue
# ── Pass 1: SYSTEM RESTART REPLICA (all tables in one multiquery) ──
echo "shard-$s: $ro read-only — Pass 1: SYSTEM RESTART REPLICA..."
kubectl -n tanzusm exec "$POD" -c clickhouse -- \
clickhouse-client -u default --password "$CH_PASS" \
--query="SELECT concat('SYSTEM RESTART REPLICA ', database, '.', \`table\`, ';') FROM system.replicas WHERE is_readonly=1 FORMAT TSVRaw" 2>/dev/null \
| kubectl -n tanzusm exec -i "$POD" -c clickhouse -- \
clickhouse-client -u default --password "$CH_PASS" --multiquery
sleep 10
ro=$(kubectl -n tanzusm exec "$POD" -c clickhouse -- \
clickhouse-client -u default --password "$CH_PASS" \
--query="SELECT count() FROM system.replicas WHERE is_readonly=1" 2>/dev/null || echo "?")
[[ "$ro" == "0" ]] && echo "shard-$s: fixed after Pass 1" && continue
# ── Pass 2: SYSTEM RESTORE REPLICA (only if Pass 1 left tables read-only) ──
echo "shard-$s: $ro still read-only — Pass 2: SYSTEM RESTORE REPLICA..."
kubectl -n tanzusm exec "$POD" -c clickhouse -- \
clickhouse-client -u default --password "$CH_PASS" \
--query="SELECT concat('SYSTEM RESTORE REPLICA ', database, '.', \`table\`, ';') FROM system.replicas WHERE is_readonly=1 AND engine IN ('ReplicatedMergeTree','ReplicatedReplacingMergeTree','ReplicatedAggregatingMergeTree','ReplicatedSummingMergeTree','ReplicatedCollapsingMergeTree','ReplicatedVersionedCollapsingMergeTree','ReplicatedGraphiteMergeTree') FORMAT TSVRaw" 2>/dev/null \
| kubectl -n tanzusm exec -i "$POD" -c clickhouse -- \
clickhouse-client -u default --password "$CH_PASS" --multiquery
sleep 10
ro=$(kubectl -n tanzusm exec "$POD" -c clickhouse -- \
clickhouse-client -u default --password "$CH_PASS" \
--query="SELECT count() FROM system.replicas WHERE is_readonly=1" 2>/dev/null || echo "?")
echo "shard-$s: final read-only count: $ro"
done
echo ""
echo "=== Summary ==="
for s in $(seq 0 $((SHARD_COUNT - 1))); do
POD="chi-clickhouse-metrics-default-${s}-0-0"
ro=$(kubectl -n tanzusm exec "$POD" -c clickhouse -- \
clickhouse-client -u default --password "$CH_PASS" \
--query="SELECT count() FROM system.replicas WHERE is_readonly=1" 2>/dev/null || echo "?")
echo " shard-$s: $ro read-only"
done
# Expected: 0 on every shard
Step 8c: Create roles and users if missing
Run the below commands on shard 0
kubectl -n tanzusm exec chi-clickhouse-metrics-default-0-0-0 -c clickhouse -- \
clickhouse-client -u default --password="$CH_PASS" \
--multiquery --query="
CREATE ROLE IF NOT EXISTS log_interactive_reader ON CLUSTER 'default';
CREATE ROLE IF NOT EXISTS log_background_reader ON CLUSTER 'default';
CREATE ROLE IF NOT EXISTS log_writer ON CLUSTER 'default';
CREATE ROLE IF NOT EXISTS metric_interactive_reader ON CLUSTER 'default';
CREATE ROLE IF NOT EXISTS metric_background_reader ON CLUSTER 'default';
CREATE ROLE IF NOT EXISTS metric_writer ON CLUSTER 'default';
CREATE ROLE IF NOT EXISTS event_interactive_reader ON CLUSTER 'default';
CREATE ROLE IF NOT EXISTS event_background_reader ON CLUSTER 'default';
CREATE ROLE IF NOT EXISTS event_writer ON CLUSTER 'default';
CREATE ROLE IF NOT EXISTS trace_interactive_reader ON CLUSTER 'default';
CREATE ROLE IF NOT EXISTS trace_writer ON CLUSTER 'default';
CREATE ROLE IF NOT EXISTS remote_query_reader ON CLUSTER 'default';
"
kubectl -n tanzusm exec chi-clickhouse-metrics-default-0-0-0 -c clickhouse -- \
clickhouse-client -u default --password="$CH_PASS" \
--multiquery --query="
DROP USER IF EXISTS observability_user on cluster 'default';
CREATE USER IF NOT EXISTS observability_user on cluster 'default' IDENTIFIED BY '$CLICKHOUSE_ADMIN_PASSWORD';
DROP USER IF EXISTS observability_remote_query_user on cluster 'default';
CREATE USER IF NOT EXISTS observability_remote_query_user ON CLUSTER 'default' IDENTIFIED BY '';
"
Step 9: Resume carvel package reconciliation
kctrl package installed kick -i ensemble-helm -n tanzusm -y
kctrl package installed kick -i clickhouse-metrics -n tanzusm -y
kctrl package installed kick -i sm -n tanzusm -yVerify packages are running
kctrl app list -n tanzusm