Velero Fails to Start After TKR 1.28 Upgrade Due to PSA Enforcement and CPU Containts
search cancel

Velero Fails to Start After TKR 1.28 Upgrade Due to PSA Enforcement and CPU Containts

book

Article ID: 400901

calendar_today

Updated On:

Products

Tanzu Kubernetes Runtime

Issue/Introduction

After upgrading both Velero and the Tanzu Kubernetes Release (TKR) to version 1.28 across TKGs (VKS) clusters running Supervisor 7, Velero failed to start. This caused a complete outage of Velero’s backup functionality across all impacted clusters.

The root cause was PodSecurity Admission (PSA) enforcement at the restricted level, which blocked Velero’s ReplicaSet from creating pods due to missing securityContext fields. An additional issue—CPU resource exhaustion—prevented the Velero pod from scheduling, even after the PSA configuration was corrected.

Environment

Tanzu Kubernetes Runtime

Cause

Two root causes were identified:

  1. PSA Enforcement at Restricted Level
    1. The upgraded TKR enabled stricter PSA policies (restricted level by default). The Velero deployment did not conform to these policies. Specific violations included:
      1. allowPrivilegeEscalation was not explicitly set to false
      2. Containers did not drop all capabilities
      3. runAsNonRoot was not set to true
      4. seccompProfile.type was missing
  2. Cluster CPU Resource Exhaustion
    1. In one cluster, even after PSA was addressed, the Velero pod remained unscheduled because it requested 500m CPU and all available worker nodes were overcommitted.

Resolution

Step 1: Lower PSA Enforcement Level

To allow Velero to start, reduce the PSA level from restricted to baseline in the Velero namespace:

kubectl label ns velero pod-security.kubernetes.io/enforce=baseline –overwrite

Then restart the Velero deployment:

kubectl rollout restart deployment -n velero

 

Step 2: Resolve CPU Resource Constraints (if applicable)

Pods still failed to schedule due to CPU exhaustion. To fix:

  • Edit the Velero deployment and reduce the CPU request from 500m to 200m or another schedulable value.
  • Save the changes and wait for the pod to be scheduled.

You can assess node capacity using the following:

kubectl describe nodes | grep -A5 “Allocatable”

Or use the TMC UI:

Clusters > [cluster name] > Nodes

Depending on findings, consider rebalancing workloads or scaling out the cluster.