ClickHouse operator may stall after a Velero-based restore and leave the ClickHouseInstallation in InProgress
search cancel

ClickHouse operator may stall after a Velero-based restore and leave the ClickHouseInstallation in InProgress

book

Article ID: 438113

calendar_today

Updated On:

Products

VMware Tanzu Application Catalog

Issue/Introduction

After restoring a backup that uses Velero, the ClickHouse operator pod may start before all required access control and dependent resources are fully restored. In this condition, the operator can log temporary authorization errors when attempting to read or manage Kubernetes resources such as StatefulSets, ClickHouseInstallation (CHI) objects, and PersistentVolumeClaims. In some environments, the operator does not resume normal reconciliation automatically after those resources become available, and the CHI remains in InProgress until the operator pod is restarted.

Typical symptoms include:

  • the ClickHouse operator pod starts during restore
  • temporary authentication or authorization errors appear in the operator logs
  • supporting resources such as ServiceAccounts, Roles, and RoleBindings appear shortly afterward
  • the operator stops producing further reconciliation logs
  • the CHI remains in InProgress
  • restarting the operator pod allows reconciliation to continue and the restore to complete

Environment

This issue can be observed in environments that use:

  • Velero as the restore mechanism for Kubernetes resources, using Velero’s default restore order.
  • ClickHouse deployed through the ClickHouse Operator. Broadcom’s Bitnami/Tanzu ClickHouse Operator offering is based on the Altinity Kubernetes Operator for ClickHouse.
  • A ClickHouse Operator deployment that requires a ServiceAccount with RBAC privileges to create and manage Kubernetes objects on behalf of ClickHouse installations.

Cause

Velero restores resources one at a time and, by default, restores resources in a predefined order that includes CustomResourceDefinitions, Namespaces, PersistentVolumes, PersistentVolumeClaims, Secrets, ConfigMaps, and ServiceAccounts. Resources not explicitly listed in the restore priority sequence are appended afterward in alphabetical order unless the Velero server is configured with a custom --restore-resource-priorities value.

Because the ClickHouse Operator depends on a ServiceAccount and RBAC permissions to watch and manage Kubernetes objects, the operator can start during the restore window before all required permissions and related resources are fully available. ClickHouse Operator requires a ServiceAccount with privileges to create and destroy multiple Kubernetes objects.

If the operator starts too early, it may encounter temporary authorization failures while Velero is still restoring the required objects. In some cases, the operator does not recover cleanly after those objects become available, and reconciliation remains stalled until the operator pod is restarted. This behavior is consistent with a restore sequencing problem affecting operator startup and reconciliation timing. 

Resolution

If the issue has already occurred and the CHI remains in InProgress, restart the ClickHouse operator pod or rollout restart the deployment. In affected environments, this reinitializes the operator after all restored resources are present and allows reconciliation to continue.

As a permanent solution, you can customize Velero restore order. Velero supports a custom restore ordering through the --restore-resource-priorities flag on the Velero server. This setting applies to future restores. Resources not included in the custom list are appended afterward in alphabetical order.

Where operationally appropriate, adjust restore priorities so that the operator does not become runnable before its required resources, especially:

  • CustomResourceDefinitions
  • ServiceAccounts
  • Roles / ClusterRoles
  • RoleBindings / ClusterRoleBindings
  • Secrets
  • ConfigMaps
  • CHI-related custom resources and supporting objects

Because restore order is configured globally on the Velero server, evaluate the impact on other restored applications before changing it. Velero recommends using the default restore order unless customization is required.

Additional Information

Velero documents the default restore order as:

  • Custom Resource Definitions
  • Namespaces
  • StorageClasses
  • VolumeSnapshotClass
  • VolumeSnapshotContents
  • VolumeSnapshots
  • PersistentVolumes
  • PersistentVolumeClaims
  • Secrets
  • ConfigMaps
  • ServiceAccounts
  • LimitRanges
  • Pods
  • ReplicaSets
  • Clusters
  • ClusterResourceSets

Velero relevant doc: https://velero.io/docs/main/restore-reference/#restore-order