Kubernetes Protection Backups Fail with "Storage bucket cannot be found" or "Unavailable" status in VMware Cloud Director Object Storage Extension.
search cancel

Kubernetes Protection Backups Fail with "Storage bucket cannot be found" or "Unavailable" status in VMware Cloud Director Object Storage Extension.

book

Article ID: 427603

calendar_today

Updated On:

Products

VMware Cloud Director

Issue/Introduction

When attempting to perform a backup of a Kubernetes cluster deployed via Cloud Director Service Engine (CSE) using the Object Storage Extension (OSE) UI, the process fails. Users may observe the following symptoms:

  • The Backup Storage Location status is shown as Unavailable.

  • The UI returns an error: Storage bucket cannot be found.

  • Velero pod logs indicate: failed to resolve service endpoint... A region must be set when sending requests to S3.

  • Manual attempts to list backups result in 404 Not Found or NoSuchKey for velero-backup.json.

Environment

 

  • vCloud Director: 10.6.1-24648072

  • VMware Object Storage Extension (OSE): 3.1.0-246734

  • Kubernetes Protection: Velero-based backup within OSE

  • Infrastructure: Clusters deployed via Container Service Extension (CSE)

 

Cause

The issue is typically caused by one or more of the following factors:

  1. Incorrect User Context: Initiating Kubernetes protection workflows using a Provider Administrator (e.g., cseadmin) instead of an Organization Administrator. The Provider account does not have the correct tenant-to-bucket mapping context, causing the S3 endpoint resolution to fail.

  2. Node Scheduling Constraints (Taints): By default, Kubernetes control plane nodes are tainted to prevent workloads from running on them. The Velero node-agent (DaemonSet) cannot schedule pods on these nodes, leading to failures when backing up the entire cluster.

  3. S3 Timeout: The default S3 request timeout may be insufficient for large metadata operations during cluster-wide backups.

Resolution

To resolve these issues, follow the steps below:

1. Use Organization Administrator Credentials

Ensure all Kubernetes protection tasks (creating protection sets, initiating backups) are performed by a user with the Organization Administrator role within the specific tenant.

  • Log out of the Provider/System context.

  • Log in as the Tenant Org Admin and re-validate the Backup Storage Location.

2. Configure Velero Tolerations (Best Practice)

Do not remove taints from the control plane nodes, as this compromises cluster security. Instead, modify the Velero node-agent DaemonSet to tolerate the control plane taints.

Execute the following command to add the necessary tolerations:

Bash
 
kubectl patch ds node-agent -n velero --type='json' -p='[{"op": "add", "path": "/spec/template/spec/tolerations/-", "value": {"key": "node-role.kubernetes.io/control-plane", "operator": "Exists", "effect": "NoSchedule"}}, {"op": "add", "path": "/spec/template/spec/tolerations/-", "value": {"key": "node-role.kubernetes.io/master", "operator": "Exists", "effect": "NoSchedule"}}]'

3. Adjust S3 Request Expiry Time

If timeouts persist, increase the S3 request expiration time within the VCD configuration:

  1. Access the OSE/VCD configuration settings.

  2. Set the following parameter:

    oss.s3.request-expire-time=3600

  3. Restart the OSE services to apply changes.