During an upgrade of the Hub from v10.3.4 to v10.3.5, the upgrade process fails. The initial failure logs point to an issue with setting labels on a ClickHouse node that no longer exists in the environment.
The console output typically displays the following error sequence:
[i] Setting labels on 1 nodes out of 1
[i] Setting node 192.168.100.60 index label clickhouse-metrics-0 from PVC data-volume-claim-chi-clickhouse-logs-default-0-0-0
[x] Could not perform installation step: patching node 192.168.100.60 failed: nodes "192.168.100.60" not found step="Set Clickhouse nodes labels."
Tanzu Hub (Upgrading from v10.3.4 to v10.3.5)
Hub deployed using the 'evaluation' sizing profile
While the initial error logs indicate a missing ClickHouse node, this error is a red herring and the issue is not caused by the ClickHouse server. The true root cause is resource exhaustion. Because the Hub is attached to a larger foundation, the standard downtime required during the upgrade process creates a significant backlog of requests. When the components restart post-downtime, they are hit with a massive burst of incoming traffic.
Because the Hub was deployed using the low-resource 'evaluation' profile, the ensemble_observability_store pod does not have enough CPU/memory allocated to process this post-upgrade burst. The pod becomes overwhelmed and fails to handle the load, causing the broader upgrade installation steps to time out and fail (manifesting as the ClickHouse patching error).
To resolve the issue and allow the upgrade to complete, you must temporarily allocate enough resources to the ensemble_observability_store pod so it can process the accumulated backlog of requests.
Step 1: Temporarily Increase Pod Resources Manually edit the deployment for the ensemble_observability_store pod to increase its resource limits (CPU and Memory) to match the larger 'foundation' profile sizing.
$ kubectl patch packageinstall/sm -n tanzusm -p '{"spec":{"paused":true}}' --type=merge
$ kubectl patch packageinstall/ensemble-helm -n tanzusm -p '{"spec":{"paused":true}}' --type=merge
ensemble_observability_store deployment$ kubectl edit deployment ensemble-observability-store -n tanzusm
Step 2: Allow the Backlog to Clear Monitor the ensemble_observability_store and ClickHouse pods. Once the resources are increased, both pods should transition to a Running state and successfully serve requests. Wait for the burst of queued requests from the downtime to be fully processed.
Step 3: Revert to the Original Profile Once the environment has settled and normal operations resume, manually scale the ensemble_observability_store pod resources back down to the 'evaluation' profile limits.
$ kubectl patch packageinstall/sm -n tanzusm -p '{"spec":{"paused":false}}' --type=merge
$ kubectl patch packageinstall/ensemble-helm -n tanzusm -p '{"spec":{"paused":false}}' --type=merge
Future Upgrades: To prevent this issue from occurring during future patching or major version upgrades, we highly recommend re-deploying or permanently configuring the Hub with a larger sizing profile.