When attempting to perform a backup of a Kubernetes cluster deployed via Cloud Director Service Engine (CSE) using the Object Storage Extension (OSE) UI, the process fails. Users may observe the following symptoms:
The Backup Storage Location status is shown as Unavailable.
The UI returns an error: Storage bucket cannot be found.
Velero pod logs indicate: failed to resolve service endpoint... A region must be set when sending requests to S3.
Manual attempts to list backups result in 404 Not Found or NoSuchKey for velero-backup.json.
vCloud Director: 10.6.1-24648072
VMware Object Storage Extension (OSE): 3.1.0-246734
Kubernetes Protection: Velero-based backup within OSE
Infrastructure: Clusters deployed via Container Service Extension (CSE)
The issue is typically caused by one or more of the following factors:
Incorrect User Context: Initiating Kubernetes protection workflows using a Provider Administrator (e.g., cseadmin) instead of an Organization Administrator. The Provider account does not have the correct tenant-to-bucket mapping context, causing the S3 endpoint resolution to fail.
Node Scheduling Constraints (Taints): By default, Kubernetes control plane nodes are tainted to prevent workloads from running on them. The Velero node-agent (DaemonSet) cannot schedule pods on these nodes, leading to failures when backing up the entire cluster.
S3 Timeout: The default S3 request timeout may be insufficient for large metadata operations during cluster-wide backups.
To resolve these issues, follow the steps below:
Ensure all Kubernetes protection tasks (creating protection sets, initiating backups) are performed by a user with the Organization Administrator role within the specific tenant.
Log out of the Provider/System context.
Log in as the Tenant Org Admin and re-validate the Backup Storage Location.
Do not remove taints from the control plane nodes, as this compromises cluster security. Instead, modify the Velero node-agent DaemonSet to tolerate the control plane taints.
Execute the following command to add the necessary tolerations:
kubectl patch ds node-agent -n velero --type='json' -p='[{"op": "add", "path": "/spec/template/spec/tolerations/-", "value": {"key": "node-role.kubernetes.io/control-plane", "operator": "Exists", "effect": "NoSchedule"}}, {"op": "add", "path": "/spec/template/spec/tolerations/-", "value": {"key": "node-role.kubernetes.io/master", "operator": "Exists", "effect": "NoSchedule"}}]'
If timeouts persist, increase the S3 request expiration time within the VCD configuration:
Access the OSE/VCD configuration settings.
Set the following parameter:
oss.s3.request-expire-time=3600
Restart the OSE services to apply changes.