Hub Upgrade Fails: "Set Clickhouse nodes labels" Error due to Resource Exhaustion on ensemble_observability_store Pod
search cancel

Hub Upgrade Fails: "Set Clickhouse nodes labels" Error due to Resource Exhaustion on ensemble_observability_store Pod

book

Article ID: 431563

calendar_today

Updated On:

Products

VMware Tanzu Platform - Hub

Issue/Introduction

During an upgrade of the Hub from v10.3.4 to v10.3.5, the upgrade process fails. The initial failure logs point to an issue with setting labels on a ClickHouse node that no longer exists in the environment.

The console output typically displays the following error sequence:

[i] Setting labels on 1 nodes out of 1  
[i] Setting node 192.168.100.60 index label clickhouse-metrics-0 from PVC data-volume-claim-chi-clickhouse-logs-default-0-0-0  
[x] Could not perform installation step: patching node 192.168.100.60 failed: nodes "192.168.100.60" not found step="Set Clickhouse nodes labels."

 

Environment

Tanzu Hub (Upgrading from v10.3.4 to v10.3.5)

Hub deployed using the 'evaluation' sizing profile

Cause

While the initial error logs indicate a missing ClickHouse node, this error is a red herring and the issue is not caused by the ClickHouse server. The true root cause is resource exhaustion. Because the Hub is attached to a larger foundation, the standard downtime required during the upgrade process creates a significant backlog of requests. When the components restart post-downtime, they are hit with a massive burst of incoming traffic.

Because the Hub was deployed using the low-resource 'evaluation' profile, the ensemble_observability_store pod does not have enough CPU/memory allocated to process this post-upgrade burst. The pod becomes overwhelmed and fails to handle the load, causing the broader upgrade installation steps to time out and fail (manifesting as the ClickHouse patching error).

Resolution

To resolve the issue and allow the upgrade to complete, you must temporarily allocate enough resources to the ensemble_observability_store pod so it can process the accumulated backlog of requests.

Step 1: Temporarily Increase Pod Resources Manually edit the deployment for the ensemble_observability_store pod to increase its resource limits (CPU and Memory) to match the larger 'foundation' profile sizing.

  • Pause the Packageinstall reconciliation 
    $ kubectl patch packageinstall/sm  -n tanzusm -p '{"spec":{"paused":true}}' --type=merge
    $ kubectl patch packageinstall/ensemble-helm  -n tanzusm -p '{"spec":{"paused":true}}' --type=merge
  • Update resources in ensemble_observability_store deployment
    $ kubectl edit deployment ensemble-observability-store -n tanzusm

Step 2: Allow the Backlog to Clear Monitor the ensemble_observability_store and ClickHouse pods. Once the resources are increased, both pods should transition to a Running state and successfully serve requests. Wait for the burst of queued requests from the downtime to be fully processed.

Step 3: Revert to the Original Profile Once the environment has settled and normal operations resume, manually scale the ensemble_observability_store pod resources back down to the 'evaluation' profile limits.

  • Unpause the Packageinstall reconciliation
    $ kubectl patch packageinstall/sm  -n tanzusm -p '{"spec":{"paused":false}}' --type=merge
    $ kubectl patch packageinstall/ensemble-helm  -n tanzusm -p '{"spec":{"paused":false}}' --type=merge

Additional Information

Future Upgrades: To prevent this issue from occurring during future patching or major version upgrades, we highly recommend re-deploying or permanently configuring the Hub with a larger sizing profile.

https://techdocs.broadcom.com/us/en/vmware-tanzu/platform/tanzu-hub/10-2/tnz-hub/install-planning.html#:~:text=with%20Tanzu%20Hub.-,Sizing%20guidelines,-Sizing%20is%20one