Tanzu Hub Upgrade - ClickHouse with 3 or more shards enter pending state during upgrade due to missing node labels

Products

VMware Tanzu Platform - Hub

Issue/Introduction

ClickHouse pods remain in Pending state after node replacement or infrastructure changes since pods are not scheduled on the correct nodes where their Persistent Volumes (PVs) are bound.
Missing or incorrect node labels for platform.tanzu.vmware.com/node on worker nodes

How to check for the issue.

Step 1: Verify ClickHouse pod status

# kubectl get pods -n tanzusm -l app=clickhouse-op -o wide
chi-clickhouse-metrics-default-0-0-0                 1/1     Pending     0               1d
chi-clickhouse-metrics-default-1-0-0                 1/1     Pending     0               1d
chi-clickhouse-metrics-default-2-0-0                 1/1     Pending     0               1d

Step 2: Check pod events for scheduling failures

# kubectl describe pod <clickhouse-pod-name> -n tanzusm | grep -A 20 "Events:"
Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  9m30s (x25 over 129m)  default-scheduler  0/20 nodes are available: 1 node(s) had untolerated taint {platform.tanzu.vmware.com/service: blobstore}, 1 node(s) had untolerated taint {platform.tanzu.vmware.com/service: prometheus}, 1 node(s) had volume node affinity conflict, 11 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {platform.tanzu.vmware.com/service: kafka}, 3 node(s) had untolerated taint {platform.tanzu.vmware.com/service: postgres}. preemption: 0/20 nodes are available: 20 Preemption is not helpful for scheduling.

Look for events like:

FailedScheduling
Node(s) didn't match Pod's node affinity/selector
Volume node affinity conflict

Environment

Tanzu Hub 10.3.3

Cause

The issue occurs due to a mismatch between Kubernetes node labels and ClickHouse Persistent Volume (PV) locality assignments.

Root Cause Details

PV Node Binding: ClickHouse uses local persistent volumes that are bound to specific nodes via the local.path.provisioner/selected-node annotation. This ensures data locality - each ClickHouse replica's data resides on a specific node.
Node Labelling for Scheduling: ClickHouse StatefulSet pods use node affinity/selectors based on the label platform.tanzu.vmware.com/node=clickhouse-metrics-<index> to ensure pods are scheduled on the correct nodes where their data volumes exist.
Label Mismatch Scenarios:
- Node replacement: When a node is replaced (maintenance, failure), the new node does not automatically inherit the ClickHouse-specific labels
- Worker nodes restart: When VMs are started using bosh commands
- Manual intervention: Accidental removal or modification of node labels
- Upgrade/migration: Platform upgrades that do not preserve custom node labels

Resolution

Step 1: SSH to the registry VM

SSH to Registry VM to run kubectl commands, KUBECONFIG with admin access is set by default

bosh -d <Hub Deployment> ssh registry

Verify connectivity

kubectl cluster-info
kubectl get nodes

Step2: Identify the PV-to-Node mapping

Get all ClickHouse PVs with their assigned nodes and claim names

# kubectl get pv -o custom-columns=NAME:.metadata.name,NODE:.metadata.annotations.'local\.path\.provisioner/selected-node',CLAIM:.spec.claimRef.name --no-headers | grep clickhouse
pvc-########-####-####-####-34f02d005d47   ###.###.###.73    data-volume-claim-chi-clickhouse-metrics-default-0-0-0
pvc-########-####-####-####-b105269af3f5   ###.###.###.89    data-volume-claim-chi-clickhouse-metrics-default-1-0-0
pvc-########-####-####-####-85e03280e858   ###.###.###.90    data-volume-claim-chi-clickhouse-metrics-default-2-0-0

Step 3: Understand the claim naming pattern and extract shard index

The claim naming pattern is:

data-volume-claim-chi-clickhouse-metrics-default-X-Y-Z

Where X (the first number in the X-Y-Z suffix) is the shard index and Y is replica index
We are looking at X-0-0, meaning replica 0 of all the shards.

Claim Pattern	Shard Index	Required Label Value
...-default-0-0-0	0	clickhouse-metrics-0
...-default-1-0-0	1	clickhouse-metrics-1
...-default-2-0-0	2	clickhouse-metrics-2

Step 4: Check the current label on each node

Before applying labels, check what label (if any) already exists:

# kubectl get pv -o custom-columns=NAME:.metadata.name,NODE:.metadata.annotations.'local\.path\.provisioner/selected-node',CLAIM:.spec.claimRef.name --no-headers | grep clickhouse | awk '{print $2}' | sort -u | while read node; do echo "Node: $node - Label: $(kubectl get node $node -o jsonpath='{.metadata.labels.platform\.tanzu\.vmware\.com/node}')"; done

Node: ###.###.###.73 - Label: clickhouse-metrics-0
Node: ###.###.###.89 - Label: clickhouse-metrics-1
Node: ###.###.###.90 - Label: clickhouse-metrics-2

If empty, the label is not set. If it returns a value, note it for comparison

Step 5: Apply the correct labels to each node

Based on the PV-to-node mapping from Step 2, apply the labels:

For shard 0 (claim ending in -0-0-0):

kubectl label node <node-name-for-shard-0> platform.tanzu.vmware.com/node=clickhouse-metrics-0 --overwrite

For shard 1 (claim ending in -1-0-0):

kubectl label node <node-name-for-shard-1> platform.tanzu.vmware.com/node=clickhouse-metrics-1 --overwrite

For shard 2 (claim ending in -2-0-0):

kubectl label node <node-name-for-shard-2> platform.tanzu.vmware.com/node=clickhouse-metrics-2 --overwrite

Note: The --overwrite flag ensures that if an incorrect label exists, it will be replaced with the correct value.

Step 6: Verify the labels were applied correctly

# kubectl get nodes -o custom-columns=NAME:.metadata.name,CLICKHOUSE_LABEL:.metadata.labels.'platform\.tanzu\.vmware\.com/node' | grep clickhouse
###.###.###.73    clickhouse-metrics-0
###.###.###.89    clickhouse-metrics-1
###.###.###.90    clickhouse-metrics-2

Step 7: Restart pending ClickHouse pods (if needed)

If ClickHouse pods are still in Pending state, delete them to trigger rescheduling:

# List pending pods
kubectl get pods -n tanzusm | grep clickhouse

# Delete pending pods (StatefulSet will recreate them)
kubectl delete pod <pending-pod-name> -n tanzusm

Step 8: Verify ClickHouse pods are now running

kubectl get pods -n tanzusm -o wide | grep clickhouse

All pods should now be in Running state and scheduled on the correctly labeled nodes.