ClickHouse fails with TOO_MANY_UNEXPECTED_DATA

Products

VMware Tanzu Platform - Hub

Issue/Introduction

You would notice ensemble-observability-store in CrashLoopBackOff, you would observe the following errors in the clickhouse server logs

Code: 231. DB::Exception: Suspiciously many (N parts, 0.00 B in total) broken parts to remove while maximum allowed broken parts count is 100.
Cannot attach table `<database>`.`<table>` from metadata file ...(TOO_MANY_UNEXPECTED_DATA_PARTS)

Since the database fails to attach one or more table, even though clickhouse server starts dependant services like ensemble-observability-store would be in crashloop state.

Environment

In Tanzu Hub ClickHouse is managed through the following stack:

PackageInstall (pkgi): clickhouse-metrics in namespace tanzusm
ClickHouseInstallation CR (CHI): managed by clickhouse-operator (Altinity)
Values Secrets: clickhouse-metrics-values-ver-N and clickhouse-secret-patch
Filesystem: Read-only in pods — no direct config file editing possible

Cause

ClickHouse has a safety threshold (max_suspicious_broken_parts, default 100) that prevents table attachment when too many broken data parts are detected. This is a safeguard against silent data loss. When the number of broken parts exceeds this threshold, the server refuses to attach the affected table, which can cascade into dependent service failures.

Common causes of excessive broken parts:

Unclean pod shutdowns or node failures during active writes/merges
Disk I/O errors or underlying storage issues on PersistentVolumes
Out-of-disk conditions during background merge operations
Replication lag combined with aggressive TTL cleanup

Resolution

1) Identify the name of the CHI resource

kubectl get chi -n tanzusm -o name
# Output: clickhouseinstallation.clickhouse.altinity.com/clickhouse-metrics

2) Verify the current setting

kubectl get chi clickhouse-metrics -n tanzusm -o jsonpath='{.spec.configuration.settings}' | jq .

3) Pause the sm and clickhouse-metrics pkgi resources

kctrl package installed pause -i sm -n tanzusm --yes
kctrl package installed pause -i clickhouse-metrics -n tanzusm --yes

4) Patch the CHI with the correct threshold

kubectl patch chi clickhouse-metrics -n tanzusm --type json \
  -p '[{"op":"add",
       "path":"/spec/configuration/settings/merge_tree~1max_suspicious_broken_parts",
       "value":"200"}]'

5) Verify the setting was applied to the CHI

kubectl get chi clickhouse-metrics -n tanzusm -o jsonpath='{.spec.configuration.settings}' | jq .

6) Restart ClickHouse pods

kubectl delete pod -n tanzusm -l clickhouse.altinity.com/chi=clickhouse-metrics

7) Verify Runtime Setting

Once the pod is restarted we can validate if the settings are applied by exec into the pod.

Note: Authentication is required, we can retrieve from the below secret:

kubectl get secret clickhouse-secret -n tanzusm -o jsonpath='{.data.password}' | base64 -d

8) Verify TABLE attachment

kubectl exec -it chi-clickhouse-metrics-default-0-0-0 \
  -c clickhouse -n tanzusm -- clickhouse-client \
  --user clickhouse --password '<PASSWORD>' \
  --query "SELECT count() FROM <database>.<table>"

9) Restart Dependant Services

kubectl rollout restart deployment ensemble-observability-store -n tanzusm
kubectl rollout restart deployment ensemble-ui -n tanzusm

Verify the CrashLoopBackOff pods recover:

kubectl get pods -n tanzusm | grep ensemble

10) Once the pods are up and running we can resume the package reconciliation

kctrl package installed kick -i sm -n tanzusm --yes
kctrl package installed kick -i clickhouse-metrics -n tanzusm --yes